[
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771020#comment-17771020
]
ASF GitHub Bot commented on NUTCH-3010:
---------------------------------------
sebastian-nagel merged PR #783:
URL: https://github.com/apache/nutch/pull/783
> Injector: count unique number of injected URLs
> ----------------------------------------------
>
> Key: NUTCH-3010
> URL: https://issues.apache.org/jira/browse/NUTCH-3010
> Project: Nutch
> Issue Type: Improvement
> Components: injector
> Affects Versions: 1.19
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the
> other for the number of URLs "merged", that is already in CrawlDb. There is
> now counter for the number of unique URLs injected which may lead to wrong
> counts if the seed files contain duplicates:
> Suppose the following seed file which contains a duplicated URL:
> {noformat}
> $> cat seeds_with_duplicates.txt
> https://www.example.org/page1.html
> https://www.example.org/page2.html
> https://www.example.org/page2.html
> $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
> ...
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
> rejected by filters: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
> injected after normalization and filtering: 3
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
> injected but already in CrawlDb: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls
> injected: 3
> ...
> {noformat}
> However, because of the duplicated URL, only 2 URLs were injected into the
> CrawlDb:
> {noformat}
> $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
> ...
> 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2
> ...
> {noformat}
> If the Injector job is run again with the same input, we get the erroneous
> output, that still one "new URL" was injected:
> {noformat}
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls
> rejected by filters: 0
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls
> injected after normalization and filtering: 3
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls
> injected but already in CrawlDb: 2
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls
> injected: 1
> {noformat}
> This is because the urls_merged counter counts unique items, while
> url_injected does not, and the shown number is the difference between both
> counters.
> Adding a counter to count the number of unique injected URLs will allow to
> get the correct count of newly injected URLs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)