[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

Sebastian Nagel (JIRA) Thu, 17 Aug 2017 03:04:09 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130178#comment-16130178
 ]


Sebastian Nagel commented on NUTCH-2335:
----------------------------------------

Hi Markus,

here a simple test. I've run it both with 1.x/master and your version of 
Injector. The results are the same except for the additional debug output.

{noformat}
$ cat seeds1.txt
http://host1.example.com/allowed.html
http://host1.example.com/forbidden.html

$ cat seeds2.txt
http://host2.example.com/allowed.html
http://host2.example.com/forbidden.html

# a trivial URL filter
$ cat $NUTCH_HOME/conf/test-filter.txt 
-forbidden
+allow

$ nutch inject -Dplugin.includes=urlfilter-regex 
-Durlfilter.regex.file=test-filter.txt crawldb seeds1.txt -noNormalize -noFilter
Injector: starting at 2017-08-17 11:41:11
Injector: crawlDb: crawldb
Injector: urlDir: seeds1.txt
Injector: Converting injected urls to crawl db entries.
normalize/filter: http://host1.example.com/allowed.html
normalize/filter: http://host1.example.com/forbidden.html
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 2
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 2
Injector: finished at 2017-08-17 11:41:13, elapsed: 00:00:02
{noformat}

=> filters are not applied to URLs from seeds1.txt, both URLs injected.

Note, the log message only indicates that URLs are passed to the filterNormlize 
method,
but filters are not active (filters == null).

Now inject the second seed list with filters for seed URLs enabled (by default):

{noformat}
$ nutch inject -Dplugin.includes=urlfilter-regex 
-Durlfilter.regex.file=test-filter.txt crawldb seeds2.txt
Injector: starting at 2017-08-17 11:41:41
Injector: crawlDb: crawldb
Injector: urlDir: seeds2.txt
Injector: Converting injected urls to crawl db entries.
normalize/filter: http://host2.example.com/allowed.html
normalize/filter: http://host2.example.com/forbidden.html
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2017-08-17 11:41:43, elapsed: 00:00:01

$ nutch readdb crawldb -stats
CrawlDb statistics start: crawldb
Statistics for CrawlDb: crawldb
TOTAL urls:     3
...
{noformat}

As expected filters are only applied to URLs from seeds2.txt, the two items in 
CrawlDb are left untouched.
I've verified via readdb -dump the content of the CrawlDb:
-  http://host1.example.com/allowed.html
-  http://host1.example.com/forbidden.html   (still in CrawlDb although 
"forbidden" by filters)
-  http://host2.example.com/allowed.html


Inject seeds2.txt again but now filtering also existing items in CrawlDb:

{noformat}
$ nutch inject -Dplugin.includes=urlfilter-regex 
-Durlfilter.regex.file=test-filter.txt crawldb seeds2.txt -filterNormalizeAll
Injector: starting at 2017-08-17 11:42:20
Injector: crawlDb: crawldb
Injector: urlDir: seeds2.txt
Injector: Converting injected urls to crawl db entries.
normalize/filter: http://host1.example.com/allowed.html
normalize/filter: http://host1.example.com/forbidden.html
normalize/filter: http://host2.example.com/allowed.html
normalize/filter: http://host2.example.com/allowed.html
normalize/filter: http://host2.example.com/forbidden.html
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 1
Injector: Total new urls injected: 0
Injector: finished at 2017-08-17 11:42:22, elapsed: 00:00:01
{noformat}

=> filters are applied to all 5 URLs (3 from CrawlDb, 2 from seeds2.txt), and 
the CrawlDb now contains only two items:

{noformat}
$ nutch readdb crawldb -stats
CrawlDb statistics start: crawldb
Statistics for CrawlDb: crawldb
TOTAL urls:     2
...
{noformat}

Verified that http://host1.example.com/forbidden.html was removed from CrawlDb.

What I definitely agree is that we should add a counter for "removed" items. 
Pull-request in progress...


> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
>                 Key: NUTCH-2335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 1.14
>
>         Attachments: Injector.java
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already 
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and 
> normalizing may take long for large CrawlDbs and/or complex URL filters. If 
> URL filter or normalizer rules are not changed there is no need to apply them 
> anew every time new URLs are added. Of course, injected URLs should be 
> filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

Reply via email to