Hi, Prasanth. Thanks for the information. Sadly I only know about 2.2.1 / 2.3-SNAPSHOT, and although I don't know Solr, I was looking for any other type of bug, specially in 2.3-SNAPSHOT.
I hope here in the list someone can help you more than me :) Thanks! Alfonso 2014-11-27 2:34 GMT+01:00 Prashant Shekar <[email protected]>: > Hi Alfonso, > > I was using Nutch 1.9. I did check the crawldb stats and this is what it > says: > > CrawlDb statistics start: crawl/crawldb/ > Statistics for CrawlDb: crawl/crawldb/ > TOTAL urls: 20387 > retry 0: 19483 > retry 1: 845 > retry 2: 59 > min score: 0.0 > avg score: 1.12277434E-4 > max score: 1.012 > status 1 (db_unfetched): 1785 > *status 2 (db_fetched): 3542* > *status 3 (db_gone): 10705* > status 4 (db_redir_temp): 3674 > status 5 (db_redir_perm): 681 > CrawlDb statistics: done > > So, I believe only 3542 files are actually being fetched. Let me check the > reason why the site is not allowing us to crawl the rest of the files. > > Thanks for all your help, > Prasanth Iyer > > On Wed, Nov 26, 2014 at 5:29 PM, Alfonso Nishikawa < > [email protected]> wrote: > >> Hi, Prashant, >> >> What version of Nutch are you using? >> >> Regards, >> >> Alfonso Nishikawa >> >> 2014-11-26 19:33 GMT+01:00 Prashant Shekar <[email protected]>: >> >>> Hi, >>> >>> I had a question about how data from raw crawled data from Nutch is >>> indexed into Solr. We crawled the Acadis dataset using Nutch and there were >>> 47,580 files that it retrieved. However, while indexing these files into >>> Solr, only 2929 of these documents were actually indexed. I had 2 questions: >>> >>> 1) What can be the reasons why only 2929 out of 47,580 files were >>> actually indexed in Solr? Does Solr do some deduplication on its end that >>> Nutch does not? >>> >>> 2) While checking the number of unique URLs, I found that there were >>> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if >>> there were no errors while indexing to Solr, can the number of indexed >>> files still be less than 12,201? >>> >>> Thanks, >>> Prasanth Iyer >>> >> >> >

