hi again,

i forgot to ask what does mean   _repr_  ?



> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: Content of redirected urls empty
> Date: Mon, 15 Mar 2010 15:29:48 +0000
> 
> 
> 
> 
> Oh sorry i mistook again, and yes you are complitely right....
> 1- The HTTPS has a content in my segment.
> 2- the HTTP has an empty content.
> 
> in
> my index i have the HTTPS  url with the empty content (...it's exactely
> what you said : it's just mixing the HTTPS url with
> the content of the HTTP one,) and i expected the other way round : the
> HTTPS content *with* the HTTP URL.
> 
> 
> i dont know if i have the HTTP url in my index, i dont know how to see all 
> the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS 
> i obtain only the HTTPS url with an empty content (i guess it's the empty 
> content of the HTTP one).
> but again in the segment the content of the https is not empty.
> 
> 
> 
> > Date: Mon, 15 Mar 2010 13:44:33 +0000
> > Subject: Re: Content of redirected urls empty
> > From: lists.digitalpeb...@gmail.com
> > To: nutch-user@lucene.apache.org
> > 
> > >
> > > and as i said the last day, on my segment the https has an empty content.
> > 
> > 
> > hmm it's not what you said in your previous message + I can see it has a
> > signature in the crawlDB so it must have a content.
> > 
> > I expect that the content would be indexed under the http://  URL thanks to
> > *_repr_: **http://myDNS/index.html*
> > 
> > See BasicIndexingFilter for details.
> > 
> > it's just mixing the HTTPS url with the content of the HTTP one.
> > 
> > 
> > it should be the other way round : the HTTPS content *with* the HTTP URL.
> > Actually the http:// document is not sent to the index at all (see around
> > line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
> > the https doc with _repr_ used as a URL.
> > 
> > can you please confirm that :
> > 1/ the segment has a content for the https:// doc
> > 2/ you can find the http:// URL in the index and it has no content
> > 
> > HTH
> > 
> > Julien
> > 
> > -- 
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> > On 15 March 2010 13:00, BELLINI ADAM <mbel...@msn.com> wrote:
> > 
> > >
> > > Hi
> > > thx for your help,
> > >
> > > this is a fresh crwal of today:
> > >
> > >
> > > 1- HTTP:
> > > bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
> > >
> > > URL: http://myDNS/index.html
> > > Version: 7
> > > Status: 4 (db_redir_temp)
> > > Fetch time: Mon Mar 15 12:15:52 EDT 2010
> > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > Retries since fetch: 0
> > > Retry interval: 36000 seconds (0 days)
> > > Score: 0.018119827
> > > Signature: null
> > > Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
> > >
> > >
> > >
> > >
> > > 2- HTTPS:
> > > bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
> > >
> > > URL: https://myDNS/index.html
> > > Version: 7
> > > Status: 2 (db_fetched)
> > > Fetch time: Mon Mar 15 12:32:34 EDT 2010
> > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > Retries since fetch: 0
> > > Retry interval: 36000 seconds (0 days)
> > > Score: 0.00511379
> > > Signature: 5f84dcec905c24e3e2af902ad9ad7398
> > > Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
> > >
> > >
> > >
> > >
> > >
> > >
> > > and as i said the last day, on my segment the https has an empty content.
> > >
> > > thx
> > >
> > >
> > > > Date: Mon, 15 Mar 2010 11:39:46 +0000
> > > > Subject: Re: Content of redirected urls empty
> > > > From: lists.digitalpeb...@gmail.com
> > > > To: nutch-user@lucene.apache.org
> > > >
> > > > Adam,
> > > >
> > > > Could you please tell us what the http and https entries look like in 
> > > > the
> > > > crawlDB (using readdb -url)?
> > > >
> > > > J.
> > > > --
> > > > DigitalPebble Ltd
> > > > http://www.digitalpebble.com
> > > >
> > > > On 13 March 2010 04:29, BELLINI ADAM <mbel...@msn.com> wrote:
> > > >
> > > > >
> > > > > no one have an answer !?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > From: mbel...@msn.com
> > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > > > Subject: RE: Content of redirected urls empty
> > > > > > Date: Wed, 10 Mar 2010 21:01:54 +0000
> > > > > >
> > > > > >
> > > > > > i read lotoff post regarding redirected urls but didnt find a
> > > sollution !
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: mbel...@msn.com
> > > > > > > To: nutch-user@lucene.apache.org; mille...@gmail.com
> > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > hi,
> > > > > > >
> > > > > > > i dont know if you did find few minutes to see my problem :)
> > > > > > >
> > > > > > > but i want to explain it again, mabe it wasnt clear :
> > > > > > >
> > > > > > >
> > > > > > > i have HTTP  pages redirected to HTTPS   (but it's the same URL):
> > > > > > >
> > > > > > > HTTP://page1.com   redirrected to HTTPS://page1.com
> > > > > > >
> > > > > > > the content of my page HTTP is empty.
> > > > > > > the content of my page HTTPS is not empty
> > > > > > >
> > > > > > > in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
> > > content
> > > > > of HTTPS page is not empty
> > > > > > >
> > > > > > > but in my index i found the HTTP one with the empty content.
> > > > > > >
> > > > > > > is there a maner to tell to nutch to index the url with the non
> > > empty
> > > > > content? or why nutch doesnt index the target URL rather than indexing
> > > the
> > > > > empty (origin) one ??
> > > > > > >
> > > > > > > thx a lot
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: mbel...@msn.com
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000
> > > > > > > >
> > > > > > > >
> > > > > > > > i'm sorry...i just checked twice...and in my index i have the
> > > > > original URL, which is  the HTTP one with the empty content...but it
> > > dosent
> > > > > index the HTTPS one....and i using solr index
> > > > > > > > thx
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: mbel...@msn.com
> > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi, i'v just dumped my segments and found that i have both 2
> > > URLS,
> > > > > the original one (HTTP) with an empty content and the REDIRCTED TO or
> > > the
> > > > > DESTINATION URL (HTTPS) with NON EMPTY content !
> > > > > > > > >
> > > > > > > > > but in my search i found only the HTTPS URL with an empty
> > > content
> > > > > !! logically the content of the HTTPS  URL is not empty !
> > > > > > > > > it's just mixing the HTTPS url with the content of the HTTP
> > > one.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > our redirect is done by java code  response.sendRedirect(…), 
> > > > > > > > > so
> > > it
> > > > > seams to be http redirect right ??
> > > > > > > > >
> > > > > > > > > thx for helping me :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Date: Mon, 8 Mar 2010 15:51:34 +0100
> > > > > > > > > > From: a...@getopt.org
> > > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > > Subject: Re: Content of redirected urls empty
> > > > > > > > > >
> > > > > > > > > > On 2010-03-08 14:55, BELLINI ADAM wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > is there any idea guys ??
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >> From: mbel...@msn.com
> > > > > > > > > > >> To: nutch-user@lucene.apache.org
> > > > > > > > > > >> Subject: Content of redirected urls empty
> > > > > > > > > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> hi,
> > > > > > > > > > >> the content of my redirected urls is empty...but still
> > > have
> > > > > the other metadata...
> > > > > > > > > > >> i have an http urls that is redirected to https.
> > > > > > > > > > >> in my index i find the http URL but with an empty
> > > content...
> > > > > > > > > > >> could you explain it plz?
> > > > > > > > > >
> > > > > > > > > > There are two ways to redirect - one is with protocol, and
> > > the
> > > > > other is
> > > > > > > > > > with content (either meta refresh, or javascript).
> > > > > > > > > >
> > > > > > > > > > When you dump the segment, is there really no content for 
> > > > > > > > > > the
> > > > > redirected
> > > > > > > > > > url?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Andrzej Bialecki     <><
> > > > > > > > > >   ___. ___ ___ ___ _ _   __________________________________
> > > > > > > > > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > > > > > > > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > > > > > > > > http://www.sigram.com  Contact: info at sigram dot com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > _________________________________________________________________
> > > > > > > > > Live connected with Messenger on your phone
> > > > > > > > > http://go.microsoft.com/?linkid=9712958
> > > > > > > >
> > > > > > > > _________________________________________________________________
> > > > > > > > IM on the go with Messenger on your phone
> > > > > > > > http://go.microsoft.com/?linkid=9712960
> > > > > > >
> > > > > > > _________________________________________________________________
> > > > > > > Stay in touch.
> > > > > > > http://go.microsoft.com/?linkid=9712959
> > > > > >
> > > > > > _________________________________________________________________
> > > > > > Take your contacts everywhere
> > > > > > http://go.microsoft.com/?linkid=9712959
> > > > >
> > > > > _________________________________________________________________
> > > > > Stay in touch.
> > > > > http://go.microsoft.com/?linkid=9712959
> > > > >
> > >
> > > _________________________________________________________________
> > > IM on the go with Messenger on your phone
> > > http://go.microsoft.com/?linkid=9712960
> > >
>                                         
> _________________________________________________________________
> Live connected with Messenger on your phone
> http://go.microsoft.com/?linkid=9712958
                                          
_________________________________________________________________
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958

Reply via email to