Cyrille Chépélov created TEZ-3113:
-------------------------------------
Summary: massive increase of run time using PipelinedSorter rather
than DefaultSorter
Key: TEZ-3113
URL: https://issues.apache.org/jira/browse/TEZ-3113
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.8.2
Environment: scalding 0.15-SNAPSHOT per
https://github.com/twitter/scalding/pull/1446
cascading 3.1.0-wip-54
tez-0.8.2
OpenJDK 8 on AMD64
Hadoop 2.6.0 (YARN, HDFS); Apache distribution
Debian Linux 8
8 * Intel Core i7-3770K
Reporter: Cyrille Chépélov
While running a (fairly complex) scalding DAG that was working fine using
tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.
Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.
Difficulties can be traced to this shape of code:
{code:scala}
val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset
x
.group
.mapValues(x => 1L)
.sum
.write(TypedTsvHeader("foo.tsv", ('key, 'count)))
{code}
where the incoming data contains many, many different keys. Observed behaviour
of PipelinedSorter is that several hundred thousand different files are put
flat in the same per-TezChild local temporary directories, and thing become
very slow (not alleging any causality).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)