[jira] [Commented] (NUTCH-2379) crawl script dedup's crawldb update is slow

Sebastian Nagel (JIRA) Thu, 04 May 2017 08:01:23 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996868#comment-15996868
 ]


Sebastian Nagel commented on NUTCH-2379:
----------------------------------------

+1 to add $commonOptions where it's missing. It's better than poor default 
values (e.g., mapreduce.job.reduces = 2).

{quote}
some of the options should be different for different steps
{quote}
That's correct. Steps are different regarding speculative execution, number of 
output parts (reducers) and compression (also intermediate / 
mapreduce.map.output.compress). But optimal settings depend on many parameters, 
not only size and hardware of the cluster, but also size of CrawlDb and 
segments.

(see also discussion on 
[user@nutch|https://lists.apache.org/thread.html/03d30a42b8945c38da131ce553de73ce2f2a3628e88891e3d48b00ab@%3Cuser.nutch.apache.org%3E])

> crawl script dedup's crawldb update is slow 
> --------------------------------------------
>
>                 Key: NUTCH-2379
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2379
>             Project: Nutch
>          Issue Type: Bug
>          Components: bin
>    Affects Versions: 1.11
>         Environment: shell
>            Reporter: Michael Coffey
>            Priority: Minor
>
>  In the standard crawl script, there is a _bin_nutch updatedb command and, 
> soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs 
> with "crawldb /path/to/crawl/db" in their names (in addition to the actual 
> deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as 
> the one launched by updatedb.
> I notice that the script passes $commonOptions to updatedb but not to dedup. 
> I suspect that the crawldb update launched by dedup may not be compressing 
> its output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NUTCH-2379) crawl script dedup's crawldb update is slow

Reply via email to