[
https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996868#comment-15996868
]
Sebastian Nagel commented on NUTCH-2379:
----------------------------------------
+1 to add $commonOptions where it's missing. It's better than poor default
values (e.g., mapreduce.job.reduces = 2).
{quote}
some of the options should be different for different steps
{quote}
That's correct. Steps are different regarding speculative execution, number of
output parts (reducers) and compression (also intermediate /
mapreduce.map.output.compress). But optimal settings depend on many parameters,
not only size and hardware of the cluster, but also size of CrawlDb and
segments.
(see also discussion on
[user@nutch|https://lists.apache.org/thread.html/03d30a42b8945c38da131ce553de73ce2f2a3628e88891e3d48b00ab@%3Cuser.nutch.apache.org%3E])
> crawl script dedup's crawldb update is slow
> --------------------------------------------
>
> Key: NUTCH-2379
> URL: https://issues.apache.org/jira/browse/NUTCH-2379
> Project: Nutch
> Issue Type: Bug
> Components: bin
> Affects Versions: 1.11
> Environment: shell
> Reporter: Michael Coffey
> Priority: Minor
>
> In the standard crawl script, there is a _bin_nutch updatedb command and,
> soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs
> with "crawldb /path/to/crawl/db" in their names (in addition to the actual
> deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as
> the one launched by updatedb.
> I notice that the script passes $commonOptions to updatedb but not to dedup.
> I suspect that the crawldb update launched by dedup may not be compressing
> its output.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)