[Nutch-general] finding the segment an url lands in

Jed Reynolds Wed, 21 Dec 2005 11:54:04 -0800

I indexed about 120,000 urls yesterday. Then I parsed out the document
name from the url and ran some searches on it (in a rather ghetto way
[1]). I came up with about 13,000 misses, 3800 of them were (presumably)
due to expired retry attempts.


Why am I getting about 10% loss here? I'm getting a lot more misses than
expired retry attempts. Can I go thru the segments directories, and look
to see what kind of keywords are accepted for each url it crawled? Is it
possible that these unique terms in the url are just not getting indexed?

What's the fastest way to find what segment a crawled url gets saved to?
Do I have to go back thru the fetch log? I've got seven segment
directories. Is this something I have to do with Luke, and go thru my
seven segment directories with Luke, one segment at a time?

Thanks

Jed

[1]
function searchHit()
{
        wget -O- --quiet ${NUTCH}"$1" \
                | awk '/Hits 1 to/ {print $7}'
}




-- 
Jed Reynolds
System Administrator, PRWeb International, Inc. 360-312-0892


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] finding the segment an url lands in

Reply via email to