Hi, Prashant, What version of Nutch are you using?
Regards, Alfonso Nishikawa 2014-11-26 19:33 GMT+01:00 Prashant Shekar <[email protected]>: > Hi, > > I had a question about how data from raw crawled data from Nutch is > indexed into Solr. We crawled the Acadis dataset using Nutch and there were > 47,580 files that it retrieved. However, while indexing these files into > Solr, only 2929 of these documents were actually indexed. I had 2 questions: > > 1) What can be the reasons why only 2929 out of 47,580 files were actually > indexed in Solr? Does Solr do some deduplication on its end that Nutch does > not? > > 2) While checking the number of unique URLs, I found that there were > 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if > there were no errors while indexing to Solr, can the number of indexed > files still be less than 12,201? > > Thanks, > Prasanth Iyer >

