Sebastian Nagel created NUTCH-3010:
--------------------------------------
Summary: Injector: count unique number of injected URLs
Key: NUTCH-3010
URL: https://issues.apache.org/jira/browse/NUTCH-3010
Project: Nutch
Issue Type: Improvement
Components: injector
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.20
Injector uses two counters: one for the total number of injected URLs, the
other for the number of URLs "merged", that is already in CrawlDb. There is now
counter for the number of unique URLs injected which may lead to wrong counts
if the seed files contain duplicates:
Suppose the following seed file which contains a duplicated URL:
{noformat}
$> cat seeds_with_duplicates.txt
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html
$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls
injected: 3
...
{noformat}
However, because of the duplicated URL, only 2 URLs were injected into the
CrawlDb:
{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2
...
{noformat}
If the Injector job is run again with the same input, we get the erroneous
output, that still one "new URL" was injected:
{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls
injected: 1
{noformat}
This is because the urls_merged counter counts unique items, while url_injected
does not.
Adding a counter to count the number of unique injected URLs will allow to get
the correct count of newly injected URLs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)