Query about indexing crawled data from Nutch to Solr

Prashant Shekar Wed, 26 Nov 2014 10:34:04 -0800

Hi,

I had a question about how data from raw crawled data from Nutch is indexed
into Solr. We crawled the Acadis dataset using Nutch and there were 47,580
files that it retrieved. However, while indexing these files into Solr,
only 2929 of these documents were actually indexed. I had 2 questions:


1) What can be the reasons why only 2929 out of 47,580 files were actually
indexed in Solr? Does Solr do some deduplication on its end that Nutch does
not?

2) While checking the number of unique URLs, I found that there were 12,201
unique URLs. We had used the URL as a key for Solr indexing. So, if there
were no errors while indexing to Solr, can the number of indexed files
still be less than 12,201?

Thanks,
Prasanth Iyer

Query about indexing crawled data from Nutch to Solr

Reply via email to