Hi Renaud,
I know about readdb. But, unless I am missing something, it doesn't know
which segment a URL is stored in. I'm after the information stored in
the segment for a URL, not the information in the crawldb.
I'm pretty sure the indexing process includes some kind of link from a
URL to the data in a segment for that URL, but I'm still looking....
Cheers,
Carl.
[EMAIL PROTECTED] wrote:
hi Carl,
see http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb
- Renaud
Carl Cerecke wrote:
Hi,
How do I get the page information from whichever segment it is in,
given a URL?
I'm basically looking for a class to use from the command-line which,
given an index and a url, returns me the information for that url from
whichever segment it is in. Similar to SegmentReader -get, but without
having to specify the segment.
This seems like it should be relatively simple to do, but it has
evaded me thus far...
Is the best approach to merge all the segments (hundreds of them) into
one big segment? Would this work? What would the performance be like
for this approach?
Cheers,
Carl.
_____________________________________________________________________
This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________