[ 
https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339044#comment-15339044
 ] 

Tsuyoshi Ozawa commented on TEZ-3113:
-------------------------------------

> Observed behaviour of PipelinedSorter is that several hundred thousand 
> different files are put flat in the same per-TezChild local temporary 
> directories, and thing become very slow (not alleging any causality)

I think this behaviour can cause file-system level lock contention if lots 
threads access.

One possible solution is partitioning by adding upper limits per directory or 
changing directory structure. [~rajesh.balamohan] what do you think?

> massive increase of run time using PipelinedSorter rather than DefaultSorter
> ----------------------------------------------------------------------------
>
>                 Key: TEZ-3113
>                 URL: https://issues.apache.org/jira/browse/TEZ-3113
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: scalding 0.15-SNAPSHOT per 
> https://github.com/twitter/scalding/pull/1446
> cascading 3.1.0-wip-54
> tez-0.8.2
> OpenJDK 8 on AMD64
> Hadoop 2.6.0 (YARN, HDFS); Apache distribution
> Debian Linux 8
> 8 * Intel Core i7-3770K 
>            Reporter: Cyrille Chépélov
>
> While running a (fairly complex) scalding DAG that was working fine using 
> tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.
> Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.
> Difficulties can be traced to this shape of code:
> {code:scala}
> val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset 
> x
>   .group
>   .mapValues(x => 1L)
>   .sum
>   .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
> {code}
> where the incoming data contains many, many different keys. Observed 
> behaviour of PipelinedSorter is that several hundred thousand different files 
> are put flat in the same per-TezChild local temporary directories, and thing 
> become very slow (not alleging any causality).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to