[ https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613196#comment-16613196 ]
Sebastian Nagel commented on NUTCH-2644: ---------------------------------------- Hi [~yossi], I've seen your deleted comment in my mailbox: ??Isn't this a much wider issue??? I've also worried about that. The pattern of error points to NUTCH-2375, so it could repeat in some other job implementation as well. I've searched for similar patterns by {noformat} git grep -B30 -Ei '(conf(ig(ugartion)?)?|job)\.set(Boolean|Long|Int|Float)?\(' src/java/ {noformat} but found that at least "core jobs" access the job config (returned by {{job.getConfiguration()}}) or modify the config before creating a job. However, when repeating this right now, I've found further issues of the same type in the webgraph jobs (I'll update the PR). Thank's for making me look at it again, and thanks for the careful review. Maybe you know a better way to search for similar error patterns? > CrawlDbReader -dump ignores filter options > ------------------------------------------ > > Key: NUTCH-2644 > URL: https://issues.apache.org/jira/browse/NUTCH-2644 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Priority: Major > Fix For: 1.16 > > > The CrawlDbReader ignores the filter options -status and -expr when dumping a > crawldb: > {noformat} > % bin/nutch readdb crawldb/ -dump cdb.dump -status 'db_fetched' -expr 'status > == "db_fetched"' > ... > % grep '^Status:' cdb.dump/part-r-00000 | sort | uniq -c > 10 Status: 1 (db_unfetched) > 28 Status: 2 (db_fetched) > 1 Status: 3 (db_gone) > 1 Status: 4 (db_redir_temp) > 3 Status: 7 (db_duplicate) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)