Jed Reynolds wrote: > Why am I getting about 10% loss here? I'm getting a lot more misses than > expired retry attempts. Can I go thru the segments directories, and look > to see what kind of keywords are accepted for each url it crawled? Is it > possible that these unique terms in the url are just not getting indexed? > > What's the fastest way to find what segment a crawled url gets saved to?
Grep, of course. $ grep -r site324172 ../segments Binary file ../segments/20051220154513/parse_data/data matches Binary file ../segments/20051220134613/parse_data/data matches Binary file ../segments/20051220122624/content/data matches Binary file ../segments/20051220122624/fetchlist/data matches Binary file ../segments/20051220122624/fetcher/data matches > Do I have to go back thru the fetch log? I've got seven segment > directories. Is this something I have to do with Luke, and go thru my > seven segment directories with Luke, one segment at a time? So I look at those segments that I grepped with luke. Unfortunately, I can't seem to find the search term in those indexes that a grep on the segments found. In luke, I look at the top terms, limited to the url field, and try to look at the top 400,000 of 490611 terms, but Luke displays only the first 21919 terms. Is this a Luke limitation, or is this an index limitation? I know that I can find any document that's actually in an index by searching by this document number as an url term. I know site59184 shows up, and I can do a url search in luke, nutch, to find it exactly. I tried some wildcard searching across all the segments (opening each one in luke and searching for url:site324*) and there were some things in the vicinity, but no document. So where do I go next? How can I make sure that the documents I feed nutch are actually discoverable? How can I tell when nutch/lucene doesn't like a document? And how do I force it in there? Jed -- Jed Reynolds System Administrator, PRWeb International, Inc. 360-312-0892 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
