[ 
https://issues.apache.org/jira/browse/TEZ-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001220#comment-16001220
 ] 

Siddharth Seth commented on TEZ-3708:
-------------------------------------

Started looking at the patch. Some initial comments / questions.

- tez.cartesian-product.max-parallelism - Max set to 1000. I believe the intent 
here is to set this based on cluster capacity? (This is what would have been 
auto determined by Tez if a consistent cluster size was available on the APIs?)
- CartesianProductCombination: Why rename getChunkId to getTaskId
- Is there an example job (similar to HashJoin as an example), which shows 
usage of this edge. That should exist, and serves as documentation. (Would also 
make reviewing easier since there's sample usage to look at, and understand 
various parameters)
- numItemPerTask = maxParallelism;  <- Something seems incorrect here. Why is 
numItemsPerTask same as maxParallelism
- minNumRecordForEstimation = (long) Math.pow(config.minOpsPerWorker, 1.0 / 
config.getSourceVertices().size()); <- Here as well. Why is this linked to 
minOpsPerWorker. Won't this value end up being very small?
- TestCartesianProductConfig - has references to null partitions, comments 
related to auto-grouping. Are all the tests / comments still valid?
- Auto-grouping configuration removed. Is there some way to configure the 
system to not generate partitions?
- Naming: CartesianProductVertexManagerUnpartitioned. "Unpartitioned" does not 
really apply any longer.
- No Round-Robin partitioned introduced? Isn't that what will be used to 
generate data to the various partitions. Goes back to adding an example.
- Vertex-Group - "Chunk i of a source group contains chunk i of every vertex in 
this group." - Why the restriction? Instead of treating a VertexGroup as a 
Source with tasks grouped together, and effectively partitions grouped together 
?



> Improve parallelism and auto grouping of unpartitioned cartesian product
> ------------------------------------------------------------------------
>
>                 Key: TEZ-3708
>                 URL: https://issues.apache.org/jira/browse/TEZ-3708
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Zhiyuan Yang
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3708.1.patch, TEZ-3708.2.patch
>
>
> Current unpartitioned cartesian product has a few limitations
> 1. parallelism can be not enough in case of large split and small # src task
> 2. parallelism can be too much in in case of large # src task
> 3. workload is not ideally distributed across the worker. Even with auto 
> grouping, grouping by size may not be accurate because same size can means 
> different #record and different cartesian product ops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to