[ 
https://issues.apache.org/jira/browse/TAJO-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095450#comment-14095450
 ] 

ASF GitHub Bot commented on TAJO-992:
-------------------------------------

GitHub user babokim opened a pull request:

    https://github.com/apache/tajo/pull/115

    TAJO-992: Reduce number of hash shuffle output file.

    For this I added the following features.
    - HashShuffleAppender which is created a single instance each a 
ExecutionBlock and Partition  in a Worker.
      Therefore, all execution block's tasks in a worker share a 
HashShuffleAppender. Each task's HashShuffleWriteExec calls 
HashShuffleAppender.appends() every 'tajo.shuffle.hash.appender.buffer.size' 
tuples(default is 10,000) for coarse-grained lock.
    - Splittable IntermediateEntry
      If a intermediate file is large, it is difficult to process with multiple 
tasks. New IntermediateEntry class has page meta data which contains start 
position and length every 'tajo.shuffle.hash.appender.page.volumn-mb' 
value(default: 30MB). Repartitioner class use that meta data for making proper 
number of tasks.
    - Failure awareness  IntermediateEntry
      If specified task is failed, failed task's tuples in the intermediate 
file  should be removed. But this is impossible because that tuples are already 
written in a file. For this IntermediateEntry has Task's tuple index meta. 
RawFile's scanner can use this data. But in this patch that meta is not used. 
I'll create another for this.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/babokim/tajo TAJO-992

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #115
    
----
commit f020bdd0ead06de5903a251fe02a534880420e35
Author: 김형준 <[email protected]>
Date:   2014-08-05T11:26:38Z

    TAJO-992: Reduce number of hash shuffle output file.

commit 36c98e20d118c8f217d7c065b574136847174f8a
Author: 김형준 <[email protected]>
Date:   2014-08-05T12:15:41Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo
    
    Conflicts:
        tajo-core/src/main/java/org/apache/tajo/worker/Fetcher.java

commit 06045064ec32b6111ece0abf7343402e419ca608
Author: 김형준 <[email protected]>
Date:   2014-08-06T13:57:35Z

    TAJO-992: Reduce number of hash shuffle output file.

commit 028f498eb18c9094b8ac7641d628ec58e3ffb605
Author: HyoungJun Kim <[email protected]>
Date:   2014-08-11T21:37:52Z

    TAJO-992: Reduce number of hash shuffle output file.
    Splittable IntermediateEntry.

commit e02f0cdf14b502dd949cf9cc5e7c0893ec312e10
Author: HyoungJun Kim <[email protected]>
Date:   2014-08-12T05:56:36Z

    TAJO-992: Reduce number of hash shuffle output file.
    Add some debug logs

commit 2d49339111be67d058158431a94680ab2749000d
Author: HyoungJun Kim <[email protected]>
Date:   2014-08-12T06:09:51Z

    TAJO-992: Reduce number of hash shuffle output file.
    Remove unused log

commit 88775ef40071c8c40eec63371c8e3523658886f0
Author: HyoungJun Kim <[email protected]>
Date:   2014-08-13T11:34:07Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into 
TAJO-992
    
    Conflicts:
        tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
        tajo-core/src/main/java/org/apache/tajo/master/GlobalEngine.java
        
tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
        tajo-core/src/main/java/org/apache/tajo/worker/Task.java
        tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java
        
tajo-core/src/test/java/org/apache/tajo/engine/planner/physical/TestPhysicalPlanner.java
        
tajo-core/src/test/java/org/apache/tajo/engine/query/TestTablePartitions.java
        tajo-core/src/test/java/org/apache/tajo/master/TestRepartitioner.java

commit 98e6314ab4974453035647f2bc78940fcb096d9e
Author: HyoungJun Kim <[email protected]>
Date:   2014-08-13T12:36:06Z

    TAJO-992: Reduce number of hash shuffle output file.
    Fix a wrong calculation of Bytes in StorageUnit

----


> Reduce number of hash shuffle output file.
> ------------------------------------------
>
>                 Key: TAJO-992
>                 URL: https://issues.apache.org/jira/browse/TAJO-992
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: data shuffle
>            Reporter: Hyoungjun Kim
>            Assignee: Hyoungjun Kim
>
> Currently Tajo creates too many intermediate files in the case of hash 
> shuffle. A execution block(SubQuery) on a TajoWorker creates intermediate 
> files  as following rule:
>   # intermediate files  in a worker = # tasks / # workers * # partitions 
> This may cause 'too many file opens' error and makes it difficult to scale 
> out. To solve this problem, We should reduce number of hash shuffle output 
> file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to