Re: Getting page information given the URL

Carl Cerecke Thu, 30 Aug 2007 17:01:56 -0700

Hi Renaud,

I know about readdb. But, unless I am missing something, it doesn't knowwhich segment a URL is stored in. I'm after the information stored inthe segment for a URL, not the information in the crawldb.

I'm pretty sure the indexing process includes some kind of link from aURL to the data in a segment for that URL, but I'm still looking....


Cheers,
Carl.

[EMAIL PROTECTED] wrote:

hi Carl,

see http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb

- Renaud


Carl Cerecke wrote:
Hi,
How do I get the page information from whichever segment it is in,given a URL?
I'm basically looking for a class to use from the command-line which,given an index and a url, returns me the information for that url fromwhichever segment it is in. Similar to SegmentReader -get, but withouthaving to specify the segment.
This seems like it should be relatively simple to do, but it hasevaded me thus far...
Is the best approach to merge all the segments (hundreds of them) intoone big segment? Would this work? What would the performance be likefor this approach?
Cheers,
Carl.
_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________

Re: Getting page information given the URL

Reply via email to