This looks like it should work, but how can I get lucene-search to do an
exact match for the URL?
I've tried:
bin/nutch org.apache.nutch.searcher.NutchBean url:<url-here>
but I can't get it to work accurately no matter how I mangle and quote
what I put in <url-here>
I've tried Luke also, but I can't get that to exactly match a url
either. Despite the fact that when searching for url:foo I can see,
among the matches, http://www.foo.co.nz, I don't seem to be able to
specifically match that (and only that) url in the general case.
Perhaps it is because the url is parsed into bits and not indexed as a
whole string, including punctuation? This is despite the fact that the
punctuation seems very much preserved intact in the index file
crawl/indexes/part-00000/_mq0.fdt
To work around this, I notice that all documents indexed have a document
ID. If I could map the url to the document ID, and from there get the
document, then that would be suitable. Any ideas?
Cheers,
Carl.
Robeyns Bart wrote:
The segment is recorded as a field in the Lucene-index. One easy way to do it
would be to:
- do a Lucene-search for the url,
- read the "segment"-field from the resulting Lucene-Document and
- call SegmentReader with this value as the segment-argument.
Bart Robeyns
Panoptic
-----Original Message-----
From: Carl Cerecke [mailto:[EMAIL PROTECTED]
Sent: Thu 8/30/2007 6:30
To: [email protected]
Subject: Getting page information given the URL
Hi,
How do I get the page information from whichever segment it is in, given
a URL?
I'm basically looking for a class to use from the command-line which,
given an index and a url, returns me the information for that url from
whichever segment it is in. Similar to SegmentReader -get, but without
having to specify the segment.
This seems like it should be relatively simple to do, but it has evaded
me thus far...
Is the best approach to merge all the segments (hundreds of them) into
one big segment? Would this work? What would the performance be like for
this approach?
Cheers,
Carl.
_____________________________________________________________________
This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________