[Nutch-general] Re: finding the segment an url lands in

Jed Reynolds Thu, 22 Dec 2005 13:03:19 -0800

Jed Reynolds wrote:

> Why am I getting about 10% loss here? I'm getting a lot more misses than
> expired retry attempts. Can I go thru the segments directories, and look
> to see what kind of keywords are accepted for each url it crawled? Is it
> possible that these unique terms in the url are just not getting indexed?
> 
> What's the fastest way to find what segment a crawled url gets saved to?


Grep, of course.

 $ grep -r site324172 ../segments
Binary file ../segments/20051220154513/parse_data/data matches
Binary file ../segments/20051220134613/parse_data/data matches
Binary file ../segments/20051220122624/content/data matches
Binary file ../segments/20051220122624/fetchlist/data matches
Binary file ../segments/20051220122624/fetcher/data matches



> Do I have to go back thru the fetch log? I've got seven segment
> directories. Is this something I have to do with Luke, and go thru my
> seven segment directories with Luke, one segment at a time?

So I look at those segments that I grepped with luke. Unfortunately, I
can't seem to find the search term in those indexes that a grep on the
segments found. In luke, I look at the top terms, limited to the url
field, and try to look at the top 400,000 of 490611 terms, but Luke
displays only the first 21919 terms. Is this a Luke limitation, or is
this an index limitation?

I know that I can find any document that's actually in an index by
searching by this document number as an url term. I know site59184 shows
up, and I can do a url search in luke, nutch, to find it exactly. I
tried some wildcard searching across all the segments (opening each one
in luke and searching for url:site324*) and there were some things in
the vicinity, but no document.

So where do I go next? How can I make sure that the documents I feed
nutch are actually discoverable? How can I tell when nutch/lucene
doesn't like a document? And how do I force it in there?

Jed

-- 
Jed Reynolds
System Administrator, PRWeb International, Inc. 360-312-0892



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: finding the segment an url lands in

Reply via email to