[ 
https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613196#comment-16613196
 ] 

Sebastian Nagel commented on NUTCH-2644:
----------------------------------------

Hi [~yossi], I've seen your deleted comment in my mailbox: ??Isn't this a much 
wider issue???  I've also worried about that. The pattern of error points to 
NUTCH-2375, so it could repeat in some other job implementation as well. I've 
searched for similar patterns by
{noformat}
git grep -B30 -Ei '(conf(ig(ugartion)?)?|job)\.set(Boolean|Long|Int|Float)?\(' 
src/java/
{noformat}
but found that at least "core jobs" access the job config (returned by 
{{job.getConfiguration()}}) or modify the config before creating a job. 
However, when repeating this right now, I've found further issues of the same 
type in the webgraph jobs (I'll update the PR). Thank's for making me look at 
it again, and thanks for the careful review. Maybe you know a better way to 
search for similar error patterns?

> CrawlDbReader -dump ignores filter options
> ------------------------------------------
>
>                 Key: NUTCH-2644
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2644
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The CrawlDbReader ignores the filter options -status and -expr when dumping a 
> crawldb:
> {noformat}
> % bin/nutch readdb crawldb/ -dump cdb.dump -status 'db_fetched' -expr 'status 
> == "db_fetched"'
> ...
> % grep '^Status:' cdb.dump/part-r-00000 | sort | uniq -c
>      10 Status: 1 (db_unfetched)
>      28 Status: 2 (db_fetched)
>       1 Status: 3 (db_gone)
>       1 Status: 4 (db_redir_temp)
>       3 Status: 7 (db_duplicate)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to