> > and as i said the last day, on my segment the https has an empty content.
hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM <mbel...@msn.com> wrote: > > Hi > thx for your help, > > this is a fresh crwal of today: > > > 1- HTTP: > bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html > > URL: http://myDNS/index.html > Version: 7 > Status: 4 (db_redir_temp) > Fetch time: Mon Mar 15 12:15:52 EDT 2010 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 36000 seconds (0 days) > Score: 0.018119827 > Signature: null > Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html > > > > > 2- HTTPS: > bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html > > URL: https://myDNS/index.html > Version: 7 > Status: 2 (db_fetched) > Fetch time: Mon Mar 15 12:32:34 EDT 2010 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 36000 seconds (0 days) > Score: 0.00511379 > Signature: 5f84dcec905c24e3e2af902ad9ad7398 > Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html > > > > > > > and as i said the last day, on my segment the https has an empty content. > > thx > > > > Date: Mon, 15 Mar 2010 11:39:46 +0000 > > Subject: Re: Content of redirected urls empty > > From: lists.digitalpeb...@gmail.com > > To: nutch-user@lucene.apache.org > > > > Adam, > > > > Could you please tell us what the http and https entries look like in the > > crawlDB (using readdb -url)? > > > > J. > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > On 13 March 2010 04:29, BELLINI ADAM <mbel...@msn.com> wrote: > > > > > > > > no one have an answer !? > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com > > > > Subject: RE: Content of redirected urls empty > > > > Date: Wed, 10 Mar 2010 21:01:54 +0000 > > > > > > > > > > > > i read lotoff post regarding redirected urls but didnt find a > sollution ! > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com > > > > > Subject: RE: Content of redirected urls empty > > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000 > > > > > > > > > > > > > > > > > > > > hi, > > > > > > > > > > i dont know if you did find few minutes to see my problem :) > > > > > > > > > > but i want to explain it again, mabe it wasnt clear : > > > > > > > > > > > > > > > i have HTTP pages redirected to HTTPS (but it's the same URL): > > > > > > > > > > HTTP://page1.com redirrected to HTTPS://page1.com > > > > > > > > > > the content of my page HTTP is empty. > > > > > the content of my page HTTPS is not empty > > > > > > > > > > in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the > content > > > of HTTPS page is not empty > > > > > > > > > > but in my index i found the HTTP one with the empty content. > > > > > > > > > > is there a maner to tell to nutch to index the url with the non > empty > > > content? or why nutch doesnt index the target URL rather than indexing > the > > > empty (origin) one ?? > > > > > > > > > > thx a lot > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > To: nutch-user@lucene.apache.org > > > > > > Subject: RE: Content of redirected urls empty > > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000 > > > > > > > > > > > > > > > > > > i'm sorry...i just checked twice...and in my index i have the > > > original URL, which is the HTTP one with the empty content...but it > dosent > > > index the HTTPS one....and i using solr index > > > > > > thx > > > > > > > > > > > > > > > > > > > > > > > > > From: mbel...@msn.com > > > > > > > To: nutch-user@lucene.apache.org > > > > > > > Subject: RE: Content of redirected urls empty > > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, i'v just dumped my segments and found that i have both 2 > URLS, > > > the original one (HTTP) with an empty content and the REDIRCTED TO or > the > > > DESTINATION URL (HTTPS) with NON EMPTY content ! > > > > > > > > > > > > > > but in my search i found only the HTTPS URL with an empty > content > > > !! logically the content of the HTTPS URL is not empty ! > > > > > > > it's just mixing the HTTPS url with the content of the HTTP > one. > > > > > > > > > > > > > > > > > > > > > our redirect is done by java code response.sendRedirect(…), so > it > > > seams to be http redirect right ?? > > > > > > > > > > > > > > thx for helping me :) > > > > > > > > > > > > > > > > > > > > > > Date: Mon, 8 Mar 2010 15:51:34 +0100 > > > > > > > > From: a...@getopt.org > > > > > > > > To: nutch-user@lucene.apache.org > > > > > > > > Subject: Re: Content of redirected urls empty > > > > > > > > > > > > > > > > On 2010-03-08 14:55, BELLINI ADAM wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > is there any idea guys ?? > > > > > > > > > > > > > > > > > > > > > > > > > > >> From: mbel...@msn.com > > > > > > > > >> To: nutch-user@lucene.apache.org > > > > > > > > >> Subject: Content of redirected urls empty > > > > > > > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000 > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> hi, > > > > > > > > >> the content of my redirected urls is empty...but still > have > > > the other metadata... > > > > > > > > >> i have an http urls that is redirected to https. > > > > > > > > >> in my index i find the http URL but with an empty > content... > > > > > > > > >> could you explain it plz? > > > > > > > > > > > > > > > > There are two ways to redirect - one is with protocol, and > the > > > other is > > > > > > > > with content (either meta refresh, or javascript). > > > > > > > > > > > > > > > > When you dump the segment, is there really no content for the > > > redirected > > > > > > > > url? > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Andrzej Bialecki <>< > > > > > > > > ___. ___ ___ ___ _ _ __________________________________ > > > > > > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > > > > > > ___|||__|| \| || | Embedded Unix, System Integration > > > > > > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > > > > Live connected with Messenger on your phone > > > > > > > http://go.microsoft.com/?linkid=9712958 > > > > > > > > > > > > _________________________________________________________________ > > > > > > IM on the go with Messenger on your phone > > > > > > http://go.microsoft.com/?linkid=9712960 > > > > > > > > > > _________________________________________________________________ > > > > > Stay in touch. > > > > > http://go.microsoft.com/?linkid=9712959 > > > > > > > > _________________________________________________________________ > > > > Take your contacts everywhere > > > > http://go.microsoft.com/?linkid=9712959 > > > > > > _________________________________________________________________ > > > Stay in touch. > > > http://go.microsoft.com/?linkid=9712959 > > > > > _________________________________________________________________ > IM on the go with Messenger on your phone > http://go.microsoft.com/?linkid=9712960 >