[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339044#comment-15339044 ]
Tsuyoshi Ozawa commented on TEZ-3113: ------------------------------------- > Observed behaviour of PipelinedSorter is that several hundred thousand > different files are put flat in the same per-TezChild local temporary > directories, and thing become very slow (not alleging any causality) I think this behaviour can cause file-system level lock contention if lots threads access. One possible solution is partitioning by adding upper limits per directory or changing directory structure. [~rajesh.balamohan] what do you think? > massive increase of run time using PipelinedSorter rather than DefaultSorter > ---------------------------------------------------------------------------- > > Key: TEZ-3113 > URL: https://issues.apache.org/jira/browse/TEZ-3113 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.8.2 > Environment: scalding 0.15-SNAPSHOT per > https://github.com/twitter/scalding/pull/1446 > cascading 3.1.0-wip-54 > tez-0.8.2 > OpenJDK 8 on AMD64 > Hadoop 2.6.0 (YARN, HDFS); Apache distribution > Debian Linux 8 > 8 * Intel Core i7-3770K > Reporter: Cyrille Chépélov > > While running a (fairly complex) scalding DAG that was working fine using > tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large. > Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour. > Difficulties can be traced to this shape of code: > {code:scala} > val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset > x > .group > .mapValues(x => 1L) > .sum > .write(TypedTsvHeader("foo.tsv", ('key, 'count))) > {code} > where the incoming data contains many, many different keys. Observed > behaviour of PipelinedSorter is that several hundred thousand different files > are put flat in the same per-TezChild local temporary directories, and thing > become very slow (not alleging any causality). -- This message was sent by Atlassian JIRA (v6.3.4#6332)