Solr does no deduplication out-of-the-box. You have to check the CrawlDB via 
readdb -stats to see how many documents are actuelly db_fetched. You may have a 
lot of redirects, errors, 404's etc. 

-----Original message-----
From: Prashant Shekar<[email protected]>
Sent: Wednesday 26th November 2014 19:33
To: [email protected]
Subject: Query about indexing crawled data from Nutch to Solr

Hi,

I had a question about how data from raw crawled data from Nutch is indexed 
into Solr. We crawled the Acadis dataset using Nutch and there were 47,580 
files that it retrieved. However, while indexing these files into Solr, only 
2929 of these documents were actually indexed. I had 2 questions:

1) What can be the reasons why only 2929 out of 47,580 files were actually 
indexed in Solr? Does Solr do some deduplication on its end that Nutch does not?

2) While checking the number of unique URLs, I found that there were 12,201 
unique URLs. We had used the URL as a key for Solr indexing. So, if there were 
no errors while indexing to Solr, can the number of indexed files still be less 
than 12,201?

Thanks,
Prasanth Iyer


Reply via email to