[
https://issues.apache.org/jira/browse/TAJO-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224491#comment-15224491
]
ASF GitHub Bot commented on TAJO-2111:
--------------------------------------
GitHub user blrunner opened a pull request:
https://github.com/apache/tajo/pull/994
TAJO-2111: Optimize Partition Table Split Computation for Amazon S3
It depends on https://github.com/apache/tajo/pull/846 and
https://github.com/apache/tajo/pull/953
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/blrunner/tajo s3-split-improvement
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/994.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #994
----
commit 0d2a634d2353efdeecced4729be9f585789acdb1
Author: JaeHwa Jung <[email protected]>
Date: 2015-10-28T08:22:40Z
Implement PartitionedFileFragment
commit 4d7e73b7b20d1e6721b0f6b2ee53c4d04eb278d4
Author: JaeHwa Jung <[email protected]>
Date: 2015-10-28T09:10:53Z
Add unit test cases for PartitionedFileFragment
commit 6fab5adadb303e690f7377547f842f84eb1f9286
Author: JaeHwa Jung <[email protected]>
Date: 2015-10-29T07:25:47Z
Add PartitionedTableUtil for finding filtered partition directories.
commit b3bbcd188b0afc3b977f85005c0dffa20a8312dc
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T06:57:39Z
Remove the array of partition directories of rerwrite rule and apply
PartitionedFileFragment.
commit 25163d0cdade5f45e7e524db4ceac4250b7ea805
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:01:56Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
Conflicts:
tajo-core-tests/src/test/java/org/apache/tajo/engine/planner/physical/TestPhysicalPlanner.java
commit 4f711fa2ff7a18979198d80a70f283f73b91edf9
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:14:49Z
Remove unnecessary method
commit 33dc1407a3d1417a81895e5e36d528f64c88bbbe
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:22:24Z
Update comments
commit dede3e2957a2cee7bccd235a3f873aac0ab40377
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:40:15Z
Remove unnecessary constructor parameter of PhysicalPlannerImpl
commit ccc4f6cb2e12bd642d00be08f393f6754e74db7f
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:55:16Z
Remove unnecessary parameter of
PartitionedTableUtil::buildTupleFromPartitionName
commit d5f563a1d6764f21f80e91a2540a9de5330a38cf
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:59:32Z
Update wrong indent
commit 086b02beb700e125a6ba37cbe275965150a89183
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T07:59:57Z
Remove unused package
commit 22731ec4a13f1ad0e75d7987966c17715afbeb52
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T08:20:01Z
Update wrong comparison operator
commit 437f5ecdc7fad8b056bb638ea0897cd6e455b9b8
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-02T08:24:05Z
Update log message
commit d76f41aac39e7536f4acac265559fb136aa05b71
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-03T00:33:16Z
When rewriting PartitionedTableScanNode, set partition paths and table
volume.
commit 126f5e06de3aa88563281fd0c382d03f4afab5bf
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-03T00:36:27Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
commit 9112ceb61547667020423bf4fbe18f99c07c2539
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-03T01:47:32Z
Update the result message of partition pruning
commit 71d65a5dec1571852448f0b349e121a9a0268a5e
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-05T08:57:54Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
commit c53dfab62079707d1e106e81b1361bc5bc21d0ad
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-06T07:19:33Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
commit a3af9ad23f55f43e0af8e0f4f84159f8448c9795
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-10T00:43:47Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
commit 548bd4293fb3de0ab725bc2372d514ffd5a70a96
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-19T02:54:44Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
Conflicts:
tajo-plan/src/main/java/org/apache/tajo/plan/logical/PartitionedTableScanNode.java
commit 66c1c496b9fb55ee2d872e843ffe2c2481adbd60
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-20T02:01:54Z
Remove unused member variable.
commit 25e23666ce826c5e0f2a64726f8e4e73ab204c2e
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-20T02:06:54Z
Remove unused method
commit c7f89f7b90cc65b8bdd294ab40426cebb73c99d0
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-20T02:14:53Z
Separate partition processing logic from existing split method.
commit e670f25eab4965bd3d5bcfbaf0540194a7ed37d9
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-20T02:22:30Z
Rename partitionName to partitionKeys in PartitionedFileFragmentProto
commit 83065d626fb2cc58e64fa17c3280191a44ddd471
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-21T15:16:13Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
TAJO-1952
commit c21d06538a7b99afd4d43ec73a8832d99754c598
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-21T15:26:50Z
Rename PartitionedFileFragment to PartitionFileFragment
commit 9d92e540c1d65b073508e5f783dd6d8358663408
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-21T15:33:00Z
Recover partition paths in LogicalNode
commit f9fcd273abb8960ff1663ffd181410d26a2a6681
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-21T15:35:43Z
Add PartitionedTableWriter::buildTupleFromPartitionName
commit 344384b5e7708be4125ba7a3bf578f13030516b4
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-23T03:19:16Z
PartitionedTableRewriter should set PartitionContent
commit b327993ba1c4749621f41fcdd67339551b64ef4e
Author: JaeHwa Jung <[email protected]>
Date: 2015-11-23T03:22:21Z
Remove unused packages
----
> Optimize Partition Table Split Computation for Amazon S3
> --------------------------------------------------------
>
> Key: TAJO-2111
> URL: https://issues.apache.org/jira/browse/TAJO-2111
> Project: Tajo
> Issue Type: Sub-task
> Components: S3, Storage
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
>
> Currently, Split computation of partitioned table proceed as follows.
> * Listing all partition directories of specified partitioned table
> * Listing all files of each partition directories
> For examples, assume a table with 1000 partitions and each partitions
> includes 10 files. In above case, AWS S3 API will be called 1000 times and it
> will become a huge bottleneck.
> To improve current computation, we have to use {{S3::listObjects}} and
> implement the following algorithm to efficiently list multiple input
> locations:
> * Given a list of S3 locations, apply prefix listing to a common prefix to
> get the metadata of 1000 objects at a time.
> * While applying prefix listing in the above step, skip those objects that do
> not fall under the input list of S3 locations to avoid ending up listing
> large number of irrelevant objects in pathogenic cases.
> Honestly, I'm inspired by Qubole's blog post as follows
> https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)