>
> and as i said the last day, on my segment the https has an empty content.


hmm it's not what you said in your previous message + I can see it has a
signature in the crawlDB so it must have a content.

I expect that the content would be indexed under the http://  URL thanks to
*_repr_: **http://myDNS/index.html*

See BasicIndexingFilter for details.

it's just mixing the HTTPS url with the content of the HTTP one.


it should be the other way round : the HTTPS content *with* the HTTP URL.
Actually the http:// document is not sent to the index at all (see around
line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
the https doc with _repr_ used as a URL.

can you please confirm that :
1/ the segment has a content for the https:// doc
2/ you can find the http:// URL in the index and it has no content

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com
On 15 March 2010 13:00, BELLINI ADAM <mbel...@msn.com> wrote:

>
> Hi
> thx for your help,
>
> this is a fresh crwal of today:
>
>
> 1- HTTP:
> bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
>
> URL: http://myDNS/index.html
> Version: 7
> Status: 4 (db_redir_temp)
> Fetch time: Mon Mar 15 12:15:52 EDT 2010
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 36000 seconds (0 days)
> Score: 0.018119827
> Signature: null
> Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
>
>
>
>
> 2- HTTPS:
> bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
>
> URL: https://myDNS/index.html
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Mon Mar 15 12:32:34 EDT 2010
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 36000 seconds (0 days)
> Score: 0.00511379
> Signature: 5f84dcec905c24e3e2af902ad9ad7398
> Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
>
>
>
>
>
>
> and as i said the last day, on my segment the https has an empty content.
>
> thx
>
>
> > Date: Mon, 15 Mar 2010 11:39:46 +0000
> > Subject: Re: Content of redirected urls empty
> > From: lists.digitalpeb...@gmail.com
> > To: nutch-user@lucene.apache.org
> >
> > Adam,
> >
> > Could you please tell us what the http and https entries look like in the
> > crawlDB (using readdb -url)?
> >
> > J.
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > On 13 March 2010 04:29, BELLINI ADAM <mbel...@msn.com> wrote:
> >
> > >
> > > no one have an answer !?
> > >
> > >
> > >
> > >
> > >
> > > > From: mbel...@msn.com
> > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > Subject: RE: Content of redirected urls empty
> > > > Date: Wed, 10 Mar 2010 21:01:54 +0000
> > > >
> > > >
> > > > i read lotoff post regarding redirected urls but didnt find a
> sollution !
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > From: mbel...@msn.com
> > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > > Subject: RE: Content of redirected urls empty
> > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000
> > > > >
> > > > >
> > > > >
> > > > > hi,
> > > > >
> > > > > i dont know if you did find few minutes to see my problem :)
> > > > >
> > > > > but i want to explain it again, mabe it wasnt clear :
> > > > >
> > > > >
> > > > > i have HTTP  pages redirected to HTTPS   (but it's the same URL):
> > > > >
> > > > > HTTP://page1.com   redirrected to HTTPS://page1.com
> > > > >
> > > > > the content of my page HTTP is empty.
> > > > > the content of my page HTTPS is not empty
> > > > >
> > > > > in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
> content
> > > of HTTPS page is not empty
> > > > >
> > > > > but in my index i found the HTTP one with the empty content.
> > > > >
> > > > > is there a maner to tell to nutch to index the url with the non
> empty
> > > content? or why nutch doesnt index the target URL rather than indexing
> the
> > > empty (origin) one ??
> > > > >
> > > > > thx a lot
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > From: mbel...@msn.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: RE: Content of redirected urls empty
> > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000
> > > > > >
> > > > > >
> > > > > > i'm sorry...i just checked twice...and in my index i have the
> > > original URL, which is  the HTTP one with the empty content...but it
> dosent
> > > index the HTTPS one....and i using solr index
> > > > > > thx
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: mbel...@msn.com
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi, i'v just dumped my segments and found that i have both 2
> URLS,
> > > the original one (HTTP) with an empty content and the REDIRCTED TO or
> the
> > > DESTINATION URL (HTTPS) with NON EMPTY content !
> > > > > > >
> > > > > > > but in my search i found only the HTTPS URL with an empty
> content
> > > !! logically the content of the HTTPS  URL is not empty !
> > > > > > > it's just mixing the HTTPS url with the content of the HTTP
> one.
> > > > > > >
> > > > > > >
> > > > > > > our redirect is done by java code  response.sendRedirect(…), so
> it
> > > seams to be http redirect right ??
> > > > > > >
> > > > > > > thx for helping me :)
> > > > > > >
> > > > > > >
> > > > > > > > Date: Mon, 8 Mar 2010 15:51:34 +0100
> > > > > > > > From: a...@getopt.org
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: Re: Content of redirected urls empty
> > > > > > > >
> > > > > > > > On 2010-03-08 14:55, BELLINI ADAM wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > is there any idea guys ??
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> From: mbel...@msn.com
> > > > > > > > >> To: nutch-user@lucene.apache.org
> > > > > > > > >> Subject: Content of redirected urls empty
> > > > > > > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> hi,
> > > > > > > > >> the content of my redirected urls is empty...but still
> have
> > > the other metadata...
> > > > > > > > >> i have an http urls that is redirected to https.
> > > > > > > > >> in my index i find the http URL but with an empty
> content...
> > > > > > > > >> could you explain it plz?
> > > > > > > >
> > > > > > > > There are two ways to redirect - one is with protocol, and
> the
> > > other is
> > > > > > > > with content (either meta refresh, or javascript).
> > > > > > > >
> > > > > > > > When you dump the segment, is there really no content for the
> > > redirected
> > > > > > > > url?
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrzej Bialecki     <><
> > > > > > > >   ___. ___ ___ ___ _ _   __________________________________
> > > > > > > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > > > > > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > > > > > > http://www.sigram.com  Contact: info at sigram dot com
> > > > > > > >
> > > > > > >
> > > > > > >
> _________________________________________________________________
> > > > > > > Live connected with Messenger on your phone
> > > > > > > http://go.microsoft.com/?linkid=9712958
> > > > > >
> > > > > > _________________________________________________________________
> > > > > > IM on the go with Messenger on your phone
> > > > > > http://go.microsoft.com/?linkid=9712960
> > > > >
> > > > > _________________________________________________________________
> > > > > Stay in touch.
> > > > > http://go.microsoft.com/?linkid=9712959
> > > >
> > > > _________________________________________________________________
> > > > Take your contacts everywhere
> > > > http://go.microsoft.com/?linkid=9712959
> > >
> > > _________________________________________________________________
> > > Stay in touch.
> > > http://go.microsoft.com/?linkid=9712959
> > >
>
> _________________________________________________________________
> IM on the go with Messenger on your phone
> http://go.microsoft.com/?linkid=9712960
>

Reply via email to