"fetching http..." vs Luke's "Number of Documents"

Kai_testing Middleton Mon, 13 Aug 2007 14:16:06 -0700

I ran the following crawl:

$ bin/nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20 -depth 10 
-topN 103103


From my understanding that's a little over a million potential documents (10 x 
103103 = 1031030). I redirected the console output to a file (using nohup, 
omitted above for clarity) and get the following:

$ grep "fetching http" nohup.out | wc -l
766306

That's about 3/4 of a million documents.  Then, within Luke, I examine the 
index after it's created:

Number of Documents:  445834

So 766306 - 445834 = 320472.

How did I lose 320,000 documents?  Did dedup do this?  I have another corpus 
where grep "fetching http" gave me 372389 lines and Luke only reported 132000 
documents--a 3 to 1 discrepancy.  Is this normal?




       
____________________________________________________________________________________
Choose the right car based on your needs.  Check out Yahoo! Autos new Car 
Finder tool.
http://autos.yahoo.com/carfinder/

"fetching http..." vs Luke's "Number of Documents"

Reply via email to