Cyrille Chépélov created TEZ-3113:
-------------------------------------

             Summary: massive increase of run time using PipelinedSorter rather 
than DefaultSorter
                 Key: TEZ-3113
                 URL: https://issues.apache.org/jira/browse/TEZ-3113
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.2
         Environment: scalding 0.15-SNAPSHOT per 
https://github.com/twitter/scalding/pull/1446
cascading 3.1.0-wip-54
tez-0.8.2
OpenJDK 8 on AMD64
Hadoop 2.6.0 (YARN, HDFS); Apache distribution
Debian Linux 8
8 * Intel Core i7-3770K 

            Reporter: Cyrille Chépélov


While running a (fairly complex) scalding DAG that was working fine using 
tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.

Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.

Difficulties can be traced to this shape of code:

{code:scala}
val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset 

x
  .group
  .mapValues(x => 1L)
  .sum
  .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
{code}

where the incoming data contains many, many different keys. Observed behaviour 
of PipelinedSorter is that several hundred thousand different files are put 
flat in the same per-TezChild local temporary directories, and thing become 
very slow (not alleging any causality).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to