Hi, Prasanth.

Thanks for the information. Sadly I only know about 2.2.1 / 2.3-SNAPSHOT,
and although I don't know Solr, I was looking for any other type of bug,
specially in 2.3-SNAPSHOT.

I hope here in the list someone can help you more than me :)

Thanks!

Alfonso

2014-11-27 2:34 GMT+01:00 Prashant Shekar <[email protected]>:

> Hi Alfonso,
>
> I was using Nutch 1.9. I did check the crawldb stats and this is what it
> says:
>
> CrawlDb statistics start: crawl/crawldb/
> Statistics for CrawlDb: crawl/crawldb/
> TOTAL urls:    20387
> retry 0:    19483
> retry 1:    845
> retry 2:    59
> min score:    0.0
> avg score:    1.12277434E-4
> max score:    1.012
> status 1 (db_unfetched):    1785
> *status 2 (db_fetched):    3542*
> *status 3 (db_gone):    10705*
> status 4 (db_redir_temp):    3674
> status 5 (db_redir_perm):    681
> CrawlDb statistics: done
>
> So, I believe only 3542 files are actually being fetched. Let me check the
> reason why the site is not allowing us to crawl the rest of the files.
>
> Thanks for all your help,
> Prasanth Iyer
>
> On Wed, Nov 26, 2014 at 5:29 PM, Alfonso Nishikawa <
> [email protected]> wrote:
>
>> Hi, Prashant,
>>
>> What version of Nutch are you using?
>>
>> Regards,
>>
>> Alfonso Nishikawa
>>
>> 2014-11-26 19:33 GMT+01:00 Prashant Shekar <[email protected]>:
>>
>>> Hi,
>>>
>>> I had a question about how data from raw crawled data from Nutch is
>>> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
>>> 47,580 files that it retrieved. However, while indexing these files into
>>> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>>>
>>> 1) What can be the reasons why only 2929 out of 47,580 files were
>>> actually indexed in Solr? Does Solr do some deduplication on its end that
>>> Nutch does not?
>>>
>>> 2) While checking the number of unique URLs, I found that there were
>>> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
>>> there were no errors while indexing to Solr, can the number of indexed
>>> files still be less than 12,201?
>>>
>>> Thanks,
>>> Prasanth Iyer
>>>
>>
>>
>

Reply via email to