how to fetch the websites with the depth level 2 links

Jenny LIU Wed, 05 Sep 2007 06:33:11 -0700

When I do fetch, nutch only give me the depth level 0  which is the website 
home page, how I can get the nutch fetch deeper than that which means it can go 
to the links in the home page and fetch those pages also? Any idea please?
   
  Thanks a lot,
   
  Jenny


Carl Cerecke <[EMAIL PROTECTED]> wrote:
  I have solved this problem by opening the index using 
org.apache.lucene.index.IndexReader to read all the Documents contained 
therein and creating a map from url to segment and document id. I can 
then use SegmentReader to get the contents for that url.

Ugly, but it works.
Carl.

Carl Cerecke wrote:
> This looks like it should work, but how can I get lucene-search to do an 
> exact match for the URL?
> 
> I've tried:
> bin/nutch org.apache.nutch.searcher.NutchBean url:
> but I can't get it to work accurately no matter how I mangle and quote 
> what I put in 
> 
> I've tried Luke also, but I can't get that to exactly match a url 
> either. Despite the fact that when searching for url:foo I can see, 
> among the matches, http://www.foo.co.nz, I don't seem to be able to 
> specifically match that (and only that) url in the general case.
> 
> Perhaps it is because the url is parsed into bits and not indexed as a 
> whole string, including punctuation? This is despite the fact that the 
> punctuation seems very much preserved intact in the index file 
> crawl/indexes/part-00000/_mq0.fdt
> 
> To work around this, I notice that all documents indexed have a document 
> ID. If I could map the url to the document ID, and from there get the 
> document, then that would be suitable. Any ideas?
> 
> Cheers,
> Carl.
> 
> Robeyns Bart wrote:
>> The segment is recorded as a field in the Lucene-index. One easy way 
>> to do it would be to:
>> - do a Lucene-search for the url, - read the "segment"-field from the 
>> resulting Lucene-Document and - call SegmentReader with this value as 
>> the segment-argument.
>>
>> Bart Robeyns
>> Panoptic
>>
>>
>>
>>
>> -----Original Message-----
>> From: Carl Cerecke [mailto:[EMAIL PROTECTED]
>> Sent: Thu 8/30/2007 6:30
>> To: [email protected]
>> Subject: Getting page information given the URL
>> 
>> Hi,
>>
>> How do I get the page information from whichever segment it is in, 
>> given a URL?
>>
>> I'm basically looking for a class to use from the command-line which, 
>> given an index and a url, returns me the information for that url from 
>> whichever segment it is in. Similar to SegmentReader -get, but without 
>> having to specify the segment.
>>
>> This seems like it should be relatively simple to do, but it has 
>> evaded me thus far...
>>
>> Is the best approach to merge all the segments (hundreds of them) into 
>> one big segment? Would this work? What would the performance be like 
>> for this approach?
>>
>> Cheers,
>> Carl.
>>
>>
>>
>> _____________________________________________________________________
>>
>> This has been cleaned & processed by www.rocketspam.co.nz
>> _____________________________________________________________________
> 
> 
> _____________________________________________________________________
> 
> This has been cleaned & processed by www.rocketspam.co.nz
> _____________________________________________________________________
> 



       
---------------------------------
Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.

how to fetch the websites with the depth level 2 links

Reply via email to