[ 
https://issues.apache.org/jira/browse/TEZ-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003790#comment-16003790
 ] 

Zhiyuan Yang commented on TEZ-3708:
-----------------------------------

Thanks [~sseth] for review!
bq. tez.cartesian-product.max-parallelism - Max set to 1000. I believe the 
intent here is to set this based on cluster capacity? (This is what would have 
been auto determined by Tez if a consistent cluster size was available on the 
APIs?)
Yes, but unfortunately we don't have a reliable way to do that.

bq.CartesianProductCombination: Why rename getChunkId to getTaskId
Originally it was getTaskId but was changed to getChunkId by mistake.
  
bq. Is there an example job (similar to HashJoin as an example), which shows 
usage of this edge. That should exist, and serves as documentation. (Would also 
make reviewing easier since there's sample usage to look at, and understand 
various parameters)
I've added CartesianProduct job which use this new edge on fake input.

bq. numItemPerTask = maxParallelism; <- Something seems incorrect here. Why is 
numItemsPerTask same as maxParallelism
I've changed numItemPerTask should be numPartition. Also, this becomes 
user-configurable and use n-sqrt of maxParallelism as default value.

bq. minNumRecordForEstimation = (long) Math.pow(config.minOpsPerWorker, 1.0 / 
config.getSourceVertices().size()); <- Here as well. Why is this linked to 
minOpsPerWorker. Won't this value end up being very small?
Changed it to a larger value.

bq. TestCartesianProductConfig - has references to null partitions, comments 
related to auto-grouping. Are all the tests / comments still valid?
Test updated.

bq. Auto-grouping configuration removed. Is there some way to configure the 
system to not generate partitions?
I've added a knob to disable grouping to make it work like previous 
unpartitioned cartesian product. In case of vertex group, it's still a little 
different.

bq. Naming: CartesianProductVertexManagerUnpartitioned. "Unpartitioned" does 
not really apply any longer.
New name "FairCartesianProductVertexManager" is used now.

bq. No Round-Robin partitioned introduced? Isn't that what will be used to 
generate data to the various partitions. Goes back to adding an example.
Added one.

Also, CPVMConfig and CPEMConfig are removed in new patch since they are 
internal only and has no difference from raw proto.

> Improve parallelism and auto grouping of unpartitioned cartesian product
> ------------------------------------------------------------------------
>
>                 Key: TEZ-3708
>                 URL: https://issues.apache.org/jira/browse/TEZ-3708
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Zhiyuan Yang
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3708.1.patch, TEZ-3708.2.patch, TEZ-3708.3.patch
>
>
> Current unpartitioned cartesian product has a few limitations
> 1. parallelism can be not enough in case of large split and small # src task
> 2. parallelism can be too much in in case of large # src task
> 3. workload is not ideally distributed across the worker. Even with auto 
> grouping, grouping by size may not be accurate because same size can means 
> different #record and different cartesian product ops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to