Re: Getting page information given the URL

Carl Cerecke Sun, 02 Sep 2007 21:30:03 -0700

This looks like it should work, but how can I get lucene-search to do anexact match for the URL?


I've tried:
bin/nutch org.apache.nutch.searcher.NutchBean url:<url-here>

but I can't get it to work accurately no matter how I mangle and quotewhat I put in <url-here>

I've tried Luke also, but I can't get that to exactly match a urleither. Despite the fact that when searching for url:foo I can see,among the matches, http://www.foo.co.nz, I don't seem to be able tospecifically match that (and only that) url in the general case.

Perhaps it is because the url is parsed into bits and not indexed as awhole string, including punctuation? This is despite the fact that thepunctuation seems very much preserved intact in the index filecrawl/indexes/part-00000/_mq0.fdt

To work around this, I notice that all documents indexed have a documentID. If I could map the url to the document ID, and from there get thedocument, then that would be suitable. Any ideas?


Cheers,
Carl.

Robeyns Bart wrote:

The segment is recorded as a field in the Lucene-index. One easy way to do it 
would be to:
- do a Lucene-search for the url,- read the "segment"-field from the resulting Lucene-Document and- call SegmentReader with this value as the segment-argument.
Bart Robeyns
Panoptic




-----Original Message-----
From: Carl Cerecke [mailto:[EMAIL PROTECTED]
Sent: Thu 8/30/2007 6:30
To: [email protected]
Subject: Getting page information given the URL
Hi,
How do I get the page information from whichever segment it is in, givena URL?
I'm basically looking for a class to use from the command-line which,given an index and a url, returns me the information for that url fromwhichever segment it is in. Similar to SegmentReader -get, but withouthaving to specify the segment.
This seems like it should be relatively simple to do, but it has evadedme thus far...
Is the best approach to merge all the segments (hundreds of them) intoone big segment? Would this work? What would the performance be like forthis approach?
Cheers,
Carl.



_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________

Re: Getting page information given the URL

Reply via email to