[jira] [Commented] (CRUNCH-602) Combiner initialization repeatedly retrieves RT nodes from DistCache, leading to high NN load

Dmitry Goldin (JIRA) Fri, 14 Oct 2016 10:58:19 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576009#comment-15576009
 ]


Dmitry Goldin commented on CRUNCH-602:
--------------------------------------

To illustrate the impact of this issue, this is some data from two runs of one
of our larger jobs (on minimally different data):

*Without fix*
{code}
Num Mappers: 12879
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-1068439135/p10/COMBINE' hdfs-audit.log* | wc -l
12923584
{code}

*With fix*
{code}
Num Mappers: 12843
Num Reducers: 500

dmi@namenode$ grep '/tmp/crunch-336925217/p9/COMBINE' hdfs-audit.log* | wc -l
4912
{code}

> Combiner initialization repeatedly retrieves RT nodes from DistCache, leading 
> to high NN load
> ---------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-602
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-602
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.12.0, 0.13.0, 0.14.0
>         Environment: Crunch 0.14-SNAPSHOT, CDH5.6.0
>            Reporter: Michael Rose
>            Assignee: Josh Wills
>              Labels: performance
>         Attachments: crunch-602.patch, quickfix_crunch_distcache.2.patch, 
> quickfix_crunch_distcache.patch
>
>
> When running one of our Crunch pipelines, we noticed our NameNode under very 
> heavy load. We run our masters on pretty light hardware, so our NN was 
> sitting at 100% CPU.
> Crunch reads the RTNodes during creation of a CrunchTaskContext. These are 
> created when Mappers and Reducers are created. Importantly, a CrunchCombiner 
> is a subclass of a Reducer, so each mapper will create R combiners where R is 
> the number of reducers and thus R CrunchTaskContexts. Consequently in highly 
> parallel jobs, this means M*R semi-expensive calls to the NameNode.
> In the constructor for CrunchTaskContext, this is the read to the DistCache:
> this.nodes = (List<RTNode>) DistCache.read(conf, path);
> Which then leads to a read into the NN + deserialization.
> For now, we took the overly simplistic approach of caching the results of the 
> DistCache read in a Guava cache. The cache ensures combiners reuse RTNodes 
> with only the overhead of deserialization which is somewhat unavoidable as 
> RTNodes are stateful and not reusable. However, it's not configurable except 
> by modifying code.
> I'll attach the patch, but given that it's not yet configurable I wouldn't 
> call it a "fix available." There may be much better ways of fixing this issue 
> as well -- if you have some guidance I'd be happy to do the legwork on a 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CRUNCH-602) Combiner initialization repeatedly retrieves RT nodes from DistCache, leading to high NN load

Reply via email to