hi again,
i forgot to ask what does mean _repr_ ? > From: mbel...@msn.com > To: nutch-user@lucene.apache.org > Subject: RE: Content of redirected urls empty > Date: Mon, 15 Mar 2010 15:29:48 +0000 > > > > > Oh sorry i mistook again, and yes you are complitely right.... > 1- The HTTPS has a content in my segment. > 2- the HTTP has an empty content. > > in > my index i have the HTTPS url with the empty content (...it's exactely > what you said : it's just mixing the HTTPS url with > the content of the HTTP one,) and i expected the other way round : the > HTTPS content *with* the HTTP URL. > > > i dont know if i have the HTTP url in my index, i dont know how to see all > the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS > i obtain only the HTTPS url with an empty content (i guess it's the empty > content of the HTTP one). > but again in the segment the content of the https is not empty. > > > > > Date: Mon, 15 Mar 2010 13:44:33 +0000 > > Subject: Re: Content of redirected urls empty > > From: lists.digitalpeb...@gmail.com > > To: nutch-user@lucene.apache.org > > > > > > > > and as i said the last day, on my segment the https has an empty content. > > > > > > hmm it's not what you said in your previous message + I can see it has a > > signature in the crawlDB so it must have a content. > > > > I expect that the content would be indexed under the http:// URL thanks to > > *_repr_: **http://myDNS/index.html* > > > > See BasicIndexingFilter for details. > > > > it's just mixing the HTTPS url with the content of the HTTP one. > > > > > > it should be the other way round : the HTTPS content *with* the HTTP URL. > > Actually the http:// document is not sent to the index at all (see around > > line 86 in IndexerMapReduce 86) so what you are seeing in the index must be > > the https doc with _repr_ used as a URL. > > > > can you please confirm that : > > 1/ the segment has a content for the https:// doc > > 2/ you can find the http:// URL in the index and it has no content > > > > HTH > > > > Julien > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > On 15 March 2010 13:00, BELLINI ADAM <mbel...@msn.com> wrote: > > > > > > > > Hi > > > thx for your help, > > > > > > this is a fresh crwal of today: > > > > > > > > > 1- HTTP: > > > bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html > > > > > > URL: http://myDNS/index.html > > > Version: 7 > > > Status: 4 (db_redir_temp) > > > Fetch time: Mon Mar 15 12:15:52 EDT 2010 > > > Modified time: Wed Dec 31 19:00:00 EST 1969 > > > Retries since fetch: 0 > > > Retry interval: 36000 seconds (0 days) > > > Score: 0.018119827 > > > Signature: null > > > Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html > > > > > > > > > > > > > > > 2- HTTPS: > > > bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html > > > > > > URL: https://myDNS/index.html > > > Version: 7 > > > Status: 2 (db_fetched) > > > Fetch time: Mon Mar 15 12:32:34 EDT 2010 > > > Modified time: Wed Dec 31 19:00:00 EST 1969 > > > Retries since fetch: 0 > > > Retry interval: 36000 seconds (0 days) > > > Score: 0.00511379 > > > Signature: 5f84dcec905c24e3e2af902ad9ad7398 > > > Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html > > > > > > > > > > > > > > > > > > > > > and as i said the last day, on my segment the https has an empty content. > > > > > > thx > > > > > > > > > > Date: Mon, 15 Mar 2010 11:39:46 +0000 > > > > Subject: Re: Content of redirected urls empty > > > > From: lists.digitalpeb...@gmail.com > > > > To: nutch-user@lucene.apache.org > > > > > > > > Adam, > > > > > > > > Could you please tell us what the http and https entries look like in > > > > the > > > > crawlDB (using readdb -url)? > > > > > > > > J. > > > > -- > > > > DigitalPebble Ltd > > > > http://www.digitalpebble.com > > > > > > > > On 13 March 2010 04:29, BELLINI ADAM <mbel...@msn.com> wrote: > > > > > > > > > > > > > > no one have an answer !? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com > > > > > > Subject: RE: Content of redirected urls empty > > > > > > Date: Wed, 10 Mar 2010 21:01:54 +0000 > > > > > > > > > > > > > > > > > > i read lotoff post regarding redirected urls but didnt find a > > > sollution ! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com > > > > > > > Subject: RE: Content of redirected urls empty > > > > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > hi, > > > > > > > > > > > > > > i dont know if you did find few minutes to see my problem :) > > > > > > > > > > > > > > but i want to explain it again, mabe it wasnt clear : > > > > > > > > > > > > > > > > > > > > > i have HTTP pages redirected to HTTPS (but it's the same URL): > > > > > > > > > > > > > > HTTP://page1.com redirrected to HTTPS://page1.com > > > > > > > > > > > > > > the content of my page HTTP is empty. > > > > > > > the content of my page HTTPS is not empty > > > > > > > > > > > > > > in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the > > > content > > > > > of HTTPS page is not empty > > > > > > > > > > > > > > but in my index i found the HTTP one with the empty content. > > > > > > > > > > > > > > is there a maner to tell to nutch to index the url with the non > > > empty > > > > > content? or why nutch doesnt index the target URL rather than indexing > > > the > > > > > empty (origin) one ?? > > > > > > > > > > > > > > thx a lot > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > > > To: nutch-user@lucene.apache.org > > > > > > > > Subject: RE: Content of redirected urls empty > > > > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000 > > > > > > > > > > > > > > > > > > > > > > > > i'm sorry...i just checked twice...and in my index i have the > > > > > original URL, which is the HTTP one with the empty content...but it > > > dosent > > > > > index the HTTPS one....and i using solr index > > > > > > > > thx > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > > > > To: nutch-user@lucene.apache.org > > > > > > > > > Subject: RE: Content of redirected urls empty > > > > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, i'v just dumped my segments and found that i have both 2 > > > URLS, > > > > > the original one (HTTP) with an empty content and the REDIRCTED TO or > > > the > > > > > DESTINATION URL (HTTPS) with NON EMPTY content ! > > > > > > > > > > > > > > > > > > but in my search i found only the HTTPS URL with an empty > > > content > > > > > !! logically the content of the HTTPS URL is not empty ! > > > > > > > > > it's just mixing the HTTPS url with the content of the HTTP > > > one. > > > > > > > > > > > > > > > > > > > > > > > > > > > our redirect is done by java code response.sendRedirect(…), > > > > > > > > > so > > > it > > > > > seams to be http redirect right ?? > > > > > > > > > > > > > > > > > > thx for helping me :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > Date: Mon, 8 Mar 2010 15:51:34 +0100 > > > > > > > > > > From: a...@getopt.org > > > > > > > > > > To: nutch-user@lucene.apache.org > > > > > > > > > > Subject: Re: Content of redirected urls empty > > > > > > > > > > > > > > > > > > > > On 2010-03-08 14:55, BELLINI ADAM wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > is there any idea guys ?? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> From: mbel...@msn.com > > > > > > > > > > >> To: nutch-user@lucene.apache.org > > > > > > > > > > >> Subject: Content of redirected urls empty > > > > > > > > > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000 > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> hi, > > > > > > > > > > >> the content of my redirected urls is empty...but still > > > have > > > > > the other metadata... > > > > > > > > > > >> i have an http urls that is redirected to https. > > > > > > > > > > >> in my index i find the http URL but with an empty > > > content... > > > > > > > > > > >> could you explain it plz? > > > > > > > > > > > > > > > > > > > > There are two ways to redirect - one is with protocol, and > > > the > > > > > other is > > > > > > > > > > with content (either meta refresh, or javascript). > > > > > > > > > > > > > > > > > > > > When you dump the segment, is there really no content for > > > > > > > > > > the > > > > > redirected > > > > > > > > > > url? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > > > > > > Andrzej Bialecki <>< > > > > > > > > > > ___. ___ ___ ___ _ _ __________________________________ > > > > > > > > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > > > > > > > > ___|||__|| \| || | Embedded Unix, System Integration > > > > > > > > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > > > > > > Live connected with Messenger on your phone > > > > > > > > > http://go.microsoft.com/?linkid=9712958 > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > > > > > IM on the go with Messenger on your phone > > > > > > > > http://go.microsoft.com/?linkid=9712960 > > > > > > > > > > > > > > _________________________________________________________________ > > > > > > > Stay in touch. > > > > > > > http://go.microsoft.com/?linkid=9712959 > > > > > > > > > > > > _________________________________________________________________ > > > > > > Take your contacts everywhere > > > > > > http://go.microsoft.com/?linkid=9712959 > > > > > > > > > > _________________________________________________________________ > > > > > Stay in touch. > > > > > http://go.microsoft.com/?linkid=9712959 > > > > > > > > > > > _________________________________________________________________ > > > IM on the go with Messenger on your phone > > > http://go.microsoft.com/?linkid=9712960 > > > > > _________________________________________________________________ > Live connected with Messenger on your phone > http://go.microsoft.com/?linkid=9712958 _________________________________________________________________ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958