[
https://issues.apache.org/jira/browse/TAJO-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095450#comment-14095450
]
ASF GitHub Bot commented on TAJO-992:
-------------------------------------
GitHub user babokim opened a pull request:
https://github.com/apache/tajo/pull/115
TAJO-992: Reduce number of hash shuffle output file.
For this I added the following features.
- HashShuffleAppender which is created a single instance each a
ExecutionBlock and Partition in a Worker.
Therefore, all execution block's tasks in a worker share a
HashShuffleAppender. Each task's HashShuffleWriteExec calls
HashShuffleAppender.appends() every 'tajo.shuffle.hash.appender.buffer.size'
tuples(default is 10,000) for coarse-grained lock.
- Splittable IntermediateEntry
If a intermediate file is large, it is difficult to process with multiple
tasks. New IntermediateEntry class has page meta data which contains start
position and length every 'tajo.shuffle.hash.appender.page.volumn-mb'
value(default: 30MB). Repartitioner class use that meta data for making proper
number of tasks.
- Failure awareness IntermediateEntry
If specified task is failed, failed task's tuples in the intermediate
file should be removed. But this is impossible because that tuples are already
written in a file. For this IntermediateEntry has Task's tuple index meta.
RawFile's scanner can use this data. But in this patch that meta is not used.
I'll create another for this.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/babokim/tajo TAJO-992
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/115.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #115
----
commit f020bdd0ead06de5903a251fe02a534880420e35
Author: 김형준 <[email protected]>
Date: 2014-08-05T11:26:38Z
TAJO-992: Reduce number of hash shuffle output file.
commit 36c98e20d118c8f217d7c065b574136847174f8a
Author: 김형준 <[email protected]>
Date: 2014-08-05T12:15:41Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo
Conflicts:
tajo-core/src/main/java/org/apache/tajo/worker/Fetcher.java
commit 06045064ec32b6111ece0abf7343402e419ca608
Author: 김형준 <[email protected]>
Date: 2014-08-06T13:57:35Z
TAJO-992: Reduce number of hash shuffle output file.
commit 028f498eb18c9094b8ac7641d628ec58e3ffb605
Author: HyoungJun Kim <[email protected]>
Date: 2014-08-11T21:37:52Z
TAJO-992: Reduce number of hash shuffle output file.
Splittable IntermediateEntry.
commit e02f0cdf14b502dd949cf9cc5e7c0893ec312e10
Author: HyoungJun Kim <[email protected]>
Date: 2014-08-12T05:56:36Z
TAJO-992: Reduce number of hash shuffle output file.
Add some debug logs
commit 2d49339111be67d058158431a94680ab2749000d
Author: HyoungJun Kim <[email protected]>
Date: 2014-08-12T06:09:51Z
TAJO-992: Reduce number of hash shuffle output file.
Remove unused log
commit 88775ef40071c8c40eec63371c8e3523658886f0
Author: HyoungJun Kim <[email protected]>
Date: 2014-08-13T11:34:07Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-992
Conflicts:
tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
tajo-core/src/main/java/org/apache/tajo/master/GlobalEngine.java
tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
tajo-core/src/main/java/org/apache/tajo/worker/Task.java
tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java
tajo-core/src/test/java/org/apache/tajo/engine/planner/physical/TestPhysicalPlanner.java
tajo-core/src/test/java/org/apache/tajo/engine/query/TestTablePartitions.java
tajo-core/src/test/java/org/apache/tajo/master/TestRepartitioner.java
commit 98e6314ab4974453035647f2bc78940fcb096d9e
Author: HyoungJun Kim <[email protected]>
Date: 2014-08-13T12:36:06Z
TAJO-992: Reduce number of hash shuffle output file.
Fix a wrong calculation of Bytes in StorageUnit
----
> Reduce number of hash shuffle output file.
> ------------------------------------------
>
> Key: TAJO-992
> URL: https://issues.apache.org/jira/browse/TAJO-992
> Project: Tajo
> Issue Type: Sub-task
> Components: data shuffle
> Reporter: Hyoungjun Kim
> Assignee: Hyoungjun Kim
>
> Currently Tajo creates too many intermediate files in the case of hash
> shuffle. A execution block(SubQuery) on a TajoWorker creates intermediate
> files as following rule:
> # intermediate files in a worker = # tasks / # workers * # partitions
> This may cause 'too many file opens' error and makes it difficult to scale
> out. To solve this problem, We should reduce number of hash shuffle output
> file.
--
This message was sent by Atlassian JIRA
(v6.2#6252)