Ming Ma created TEZ-3818:
----------------------------

             Summary: Support a new data routing policy for small partitions 
                 Key: TEZ-3818
                 URL: https://issues.apache.org/jira/browse/TEZ-3818
             Project: Apache Tez
          Issue Type: Sub-task
            Reporter: Ming Ma


Under the existing fair shuffle manager data routing policies of 
fair_parallelism and increase_parallelism, small partitions (total size up to 
the max desirable limit) are processed together by a single destination task.

We have the following use case that will prefer having one destination task 
process one small partition while still having multiple destination tasks 
process one large partition. When destination vertex is connected to 
MultiMROutput and the output format is parquet output format, each instance of 
parquet output stream consumes extra memory. So if a destination task ends up 
processing lots of small partitions, it ends up exceeding the task memory limit.

With the new data routing policy, here is the summary of what each data routing 
policy does.

* reduce_parallelism. The parallelism is decreased to a desired level by having 
one destination task process multiple consecutive partitions.
* fair_parallelism. The parallelism is adjusted to a desired level by having 
one destination task process multiple consecutive small partitions and multiple 
destination tasks process one large partition.
* The new increase_parallelism. The parallelism is increased to a desired level 
by having one destination task process each small partition and multiple 
destination tasks process one large partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to