[ https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996868#comment-15996868 ]
Sebastian Nagel commented on NUTCH-2379: ---------------------------------------- +1 to add $commonOptions where it's missing. It's better than poor default values (e.g., mapreduce.job.reduces = 2). {quote} some of the options should be different for different steps {quote} That's correct. Steps are different regarding speculative execution, number of output parts (reducers) and compression (also intermediate / mapreduce.map.output.compress). But optimal settings depend on many parameters, not only size and hardware of the cluster, but also size of CrawlDb and segments. (see also discussion on [user@nutch|https://lists.apache.org/thread.html/03d30a42b8945c38da131ce553de73ce2f2a3628e88891e3d48b00ab@%3Cuser.nutch.apache.org%3E]) > crawl script dedup's crawldb update is slow > -------------------------------------------- > > Key: NUTCH-2379 > URL: https://issues.apache.org/jira/browse/NUTCH-2379 > Project: Nutch > Issue Type: Bug > Components: bin > Affects Versions: 1.11 > Environment: shell > Reporter: Michael Coffey > Priority: Minor > > In the standard crawl script, there is a _bin_nutch updatedb command and, > soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs > with "crawldb /path/to/crawl/db" in their names (in addition to the actual > deduplication job). > In my situation, the "crawldb" job launched by dedup takes twice as long as > the one launched by updatedb. > I notice that the script passes $commonOptions to updatedb but not to dedup. > I suspect that the crawldb update launched by dedup may not be compressing > its output. -- This message was sent by Atlassian JIRA (v6.3.15#6346)