Oh sorry i mistook again, and yes you are complitely right....
1- The HTTPS has a content in my segment.
2- the HTTP has an empty content.

in
my index i have the HTTPS  url with the empty content (...it's exactely
what you said : it's just mixing the HTTPS url with
the content of the HTTP one,) and i expected the other way round : the
HTTPS content *with* the HTTP URL.


i dont know if i have the HTTP url in my index, i dont know how to see all the 
indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i 
obtain only the HTTPS url with an empty content (i guess it's the empty content 
of the HTTP one).
but again in the segment the content of the https is not empty.



> Date: Mon, 15 Mar 2010 13:44:33 +0000
> Subject: Re: Content of redirected urls empty
> From: lists.digitalpeb...@gmail.com
> To: nutch-user@lucene.apache.org
> 
> >
> > and as i said the last day, on my segment the https has an empty content.
> 
> 
> hmm it's not what you said in your previous message + I can see it has a
> signature in the crawlDB so it must have a content.
> 
> I expect that the content would be indexed under the http://  URL thanks to
> *_repr_: **http://myDNS/index.html*
> 
> See BasicIndexingFilter for details.
> 
> it's just mixing the HTTPS url with the content of the HTTP one.
> 
> 
> it should be the other way round : the HTTPS content *with* the HTTP URL.
> Actually the http:// document is not sent to the index at all (see around
> line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
> the https doc with _repr_ used as a URL.
> 
> can you please confirm that :
> 1/ the segment has a content for the https:// doc
> 2/ you can find the http:// URL in the index and it has no content
> 
> HTH
> 
> Julien
> 
> -- 
> DigitalPebble Ltd
> http://www.digitalpebble.com
> On 15 March 2010 13:00, BELLINI ADAM <mbel...@msn.com> wrote:
> 
> >
> > Hi
> > thx for your help,
> >
> > this is a fresh crwal of today:
> >
> >
> > 1- HTTP:
> > bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
> >
> > URL: http://myDNS/index.html
> > Version: 7
> > Status: 4 (db_redir_temp)
> > Fetch time: Mon Mar 15 12:15:52 EDT 2010
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 36000 seconds (0 days)
> > Score: 0.018119827
> > Signature: null
> > Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
> >
> >
> >
> >
> > 2- HTTPS:
> > bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
> >
> > URL: https://myDNS/index.html
> > Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Mon Mar 15 12:32:34 EDT 2010
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 36000 seconds (0 days)
> > Score: 0.00511379
> > Signature: 5f84dcec905c24e3e2af902ad9ad7398
> > Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
> >
> >
> >
> >
> >
> >
> > and as i said the last day, on my segment the https has an empty content.
> >
> > thx
> >
> >
> > > Date: Mon, 15 Mar 2010 11:39:46 +0000
> > > Subject: Re: Content of redirected urls empty
> > > From: lists.digitalpeb...@gmail.com
> > > To: nutch-user@lucene.apache.org
> > >
> > > Adam,
> > >
> > > Could you please tell us what the http and https entries look like in the
> > > crawlDB (using readdb -url)?
> > >
> > > J.
> > > --
> > > DigitalPebble Ltd
> > > http://www.digitalpebble.com
> > >
> > > On 13 March 2010 04:29, BELLINI ADAM <mbel...@msn.com> wrote:
> > >
> > > >
> > > > no one have an answer !?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > From: mbel...@msn.com
> > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > > Subject: RE: Content of redirected urls empty
> > > > > Date: Wed, 10 Mar 2010 21:01:54 +0000
> > > > >
> > > > >
> > > > > i read lotoff post regarding redirected urls but didnt find a
> > sollution !
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > From: mbel...@msn.com
> > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > > > Subject: RE: Content of redirected urls empty
> > > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000
> > > > > >
> > > > > >
> > > > > >
> > > > > > hi,
> > > > > >
> > > > > > i dont know if you did find few minutes to see my problem :)
> > > > > >
> > > > > > but i want to explain it again, mabe it wasnt clear :
> > > > > >
> > > > > >
> > > > > > i have HTTP  pages redirected to HTTPS   (but it's the same URL):
> > > > > >
> > > > > > HTTP://page1.com   redirrected to HTTPS://page1.com
> > > > > >
> > > > > > the content of my page HTTP is empty.
> > > > > > the content of my page HTTPS is not empty
> > > > > >
> > > > > > in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
> > content
> > > > of HTTPS page is not empty
> > > > > >
> > > > > > but in my index i found the HTTP one with the empty content.
> > > > > >
> > > > > > is there a maner to tell to nutch to index the url with the non
> > empty
> > > > content? or why nutch doesnt index the target URL rather than indexing
> > the
> > > > empty (origin) one ??
> > > > > >
> > > > > > thx a lot
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: mbel...@msn.com
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000
> > > > > > >
> > > > > > >
> > > > > > > i'm sorry...i just checked twice...and in my index i have the
> > > > original URL, which is  the HTTP one with the empty content...but it
> > dosent
> > > > index the HTTPS one....and i using solr index
> > > > > > > thx
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: mbel...@msn.com
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi, i'v just dumped my segments and found that i have both 2
> > URLS,
> > > > the original one (HTTP) with an empty content and the REDIRCTED TO or
> > the
> > > > DESTINATION URL (HTTPS) with NON EMPTY content !
> > > > > > > >
> > > > > > > > but in my search i found only the HTTPS URL with an empty
> > content
> > > > !! logically the content of the HTTPS  URL is not empty !
> > > > > > > > it's just mixing the HTTPS url with the content of the HTTP
> > one.
> > > > > > > >
> > > > > > > >
> > > > > > > > our redirect is done by java code  response.sendRedirect(…), so
> > it
> > > > seams to be http redirect right ??
> > > > > > > >
> > > > > > > > thx for helping me :)
> > > > > > > >
> > > > > > > >
> > > > > > > > > Date: Mon, 8 Mar 2010 15:51:34 +0100
> > > > > > > > > From: a...@getopt.org
> > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > Subject: Re: Content of redirected urls empty
> > > > > > > > >
> > > > > > > > > On 2010-03-08 14:55, BELLINI ADAM wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > is there any idea guys ??
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >> From: mbel...@msn.com
> > > > > > > > > >> To: nutch-user@lucene.apache.org
> > > > > > > > > >> Subject: Content of redirected urls empty
> > > > > > > > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> hi,
> > > > > > > > > >> the content of my redirected urls is empty...but still
> > have
> > > > the other metadata...
> > > > > > > > > >> i have an http urls that is redirected to https.
> > > > > > > > > >> in my index i find the http URL but with an empty
> > content...
> > > > > > > > > >> could you explain it plz?
> > > > > > > > >
> > > > > > > > > There are two ways to redirect - one is with protocol, and
> > the
> > > > other is
> > > > > > > > > with content (either meta refresh, or javascript).
> > > > > > > > >
> > > > > > > > > When you dump the segment, is there really no content for the
> > > > redirected
> > > > > > > > > url?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Andrzej Bialecki     <><
> > > > > > > > >   ___. ___ ___ ___ _ _   __________________________________
> > > > > > > > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > > > > > > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > > > > > > > http://www.sigram.com  Contact: info at sigram dot com
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > _________________________________________________________________
> > > > > > > > Live connected with Messenger on your phone
> > > > > > > > http://go.microsoft.com/?linkid=9712958
> > > > > > >
> > > > > > > _________________________________________________________________
> > > > > > > IM on the go with Messenger on your phone
> > > > > > > http://go.microsoft.com/?linkid=9712960
> > > > > >
> > > > > > _________________________________________________________________
> > > > > > Stay in touch.
> > > > > > http://go.microsoft.com/?linkid=9712959
> > > > >
> > > > > _________________________________________________________________
> > > > > Take your contacts everywhere
> > > > > http://go.microsoft.com/?linkid=9712959
> > > >
> > > > _________________________________________________________________
> > > > Stay in touch.
> > > > http://go.microsoft.com/?linkid=9712959
> > > >
> >
> > _________________________________________________________________
> > IM on the go with Messenger on your phone
> > http://go.microsoft.com/?linkid=9712960
> >
                                          
_________________________________________________________________
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958

Reply via email to