Re: Query about indexing crawled data from Nutch to Solr

Alfonso Nishikawa Wed, 26 Nov 2014 17:31:29 -0800

Hi, Prashant,

What version of Nutch are you using?


Regards,

Alfonso Nishikawa

2014-11-26 19:33 GMT+01:00 Prashant Shekar <[email protected]>:

> Hi,
>
> I had a question about how data from raw crawled data from Nutch is
> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
> 47,580 files that it retrieved. However, while indexing these files into
> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>
> 1) What can be the reasons why only 2929 out of 47,580 files were actually
> indexed in Solr? Does Solr do some deduplication on its end that Nutch does
> not?
>
> 2) While checking the number of unique URLs, I found that there were
> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
> there were no errors while indexing to Solr, can the number of indexed
> files still be less than 12,201?
>
> Thanks,
> Prasanth Iyer
>

Re: Query about indexing crawled data from Nutch to Solr

Reply via email to