[ 
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3010.
------------------------------------
    Resolution: Fixed

> Injector: count unique number of injected URLs
> ----------------------------------------------
>
>                 Key: NUTCH-3010
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3010
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the 
> other for the number of URLs "merged", that is already in CrawlDb. There is 
> now counter for the number of unique URLs injected which may lead to wrong 
> counts if the seed files contain duplicates:
> Suppose the following seed file which contains a duplicated URL:
> {noformat}
> $> cat seeds_with_duplicates.txt 
> https://www.example.org/page1.html
> https://www.example.org/page2.html
> https://www.example.org/page2.html
> $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
> ...
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 3
> ...
> {noformat}
> However, because of the duplicated URL, only 2 URLs were injected into the 
> CrawlDb:
> {noformat}
> $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
> ...
> 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
> ...
> {noformat}
> If the Injector job is run again with the same input, we get the erroneous 
> output, that still one "new URL" was injected:
> {noformat}
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 2
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 1
> {noformat}
> This is because the urls_merged counter counts unique items, while 
> url_injected does not, and the shown number is the difference between both 
> counters.
> Adding a counter to count the number of unique injected URLs will allow to 
> get the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to