Hi Renaud,

I know about readdb. But, unless I am missing something, it doesn't know which segment a URL is stored in. I'm after the information stored in the segment for a URL, not the information in the crawldb.

I'm pretty sure the indexing process includes some kind of link from a URL to the data in a segment for that URL, but I'm still looking....

Cheers,
Carl.

[EMAIL PROTECTED] wrote:
hi Carl,

see http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb

- Renaud


Carl Cerecke wrote:
Hi,

How do I get the page information from whichever segment it is in, given a URL?

I'm basically looking for a class to use from the command-line which, given an index and a url, returns me the information for that url from whichever segment it is in. Similar to SegmentReader -get, but without having to specify the segment.

This seems like it should be relatively simple to do, but it has evaded me thus far...

Is the best approach to merge all the segments (hundreds of them) into one big segment? Would this work? What would the performance be like for this approach?

Cheers,
Carl.



_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________


Reply via email to