I ran the following crawl:
$ bin/nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20 -depth 10
-topN 103103
From my understanding that's a little over a million potential documents (10 x
103103 = 1031030). I redirected the console output to a file (using nohup,
omitted above for clarity) and get the following:
$ grep "fetching http" nohup.out | wc -l
766306
That's about 3/4 of a million documents. Then, within Luke, I examine the
index after it's created:
Number of Documents: 445834
So 766306 - 445834 = 320472.
How did I lose 320,000 documents? Did dedup do this? I have another corpus
where grep "fetching http" gave me 372389 lines and Luke only reported 132000
documents--a 3 to 1 discrepancy. Is this normal?
____________________________________________________________________________________
Choose the right car based on your needs. Check out Yahoo! Autos new Car
Finder tool.
http://autos.yahoo.com/carfinder/