This looks like it should work, but how can I get lucene-search to do an exact match for the URL?

I've tried:
bin/nutch org.apache.nutch.searcher.NutchBean url:<url-here>
but I can't get it to work accurately no matter how I mangle and quote what I put in <url-here>

I've tried Luke also, but I can't get that to exactly match a url either. Despite the fact that when searching for url:foo I can see, among the matches, http://www.foo.co.nz, I don't seem to be able to specifically match that (and only that) url in the general case.

Perhaps it is because the url is parsed into bits and not indexed as a whole string, including punctuation? This is despite the fact that the punctuation seems very much preserved intact in the index file crawl/indexes/part-00000/_mq0.fdt

To work around this, I notice that all documents indexed have a document ID. If I could map the url to the document ID, and from there get the document, then that would be suitable. Any ideas?

Cheers,
Carl.

Robeyns Bart wrote:
The segment is recorded as a field in the Lucene-index. One easy way to do it 
would be to:
- do a Lucene-search for the url, - read the "segment"-field from the resulting Lucene-Document and - call SegmentReader with this value as the segment-argument.

Bart Robeyns
Panoptic




-----Original Message-----
From: Carl Cerecke [mailto:[EMAIL PROTECTED]
Sent: Thu 8/30/2007 6:30
To: [email protected]
Subject: Getting page information given the URL
Hi,

How do I get the page information from whichever segment it is in, given a URL?

I'm basically looking for a class to use from the command-line which, given an index and a url, returns me the information for that url from whichever segment it is in. Similar to SegmentReader -get, but without having to specify the segment.

This seems like it should be relatively simple to do, but it has evaded me thus far...

Is the best approach to merge all the segments (hundreds of them) into one big segment? Would this work? What would the performance be like for this approach?

Cheers,
Carl.



_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________

Reply via email to