Ming Ma created TEZ-3818: ---------------------------- Summary: Support a new data routing policy for small partitions Key: TEZ-3818 URL: https://issues.apache.org/jira/browse/TEZ-3818 Project: Apache Tez Issue Type: Sub-task Reporter: Ming Ma
Under the existing fair shuffle manager data routing policies of fair_parallelism and increase_parallelism, small partitions (total size up to the max desirable limit) are processed together by a single destination task. We have the following use case that will prefer having one destination task process one small partition while still having multiple destination tasks process one large partition. When destination vertex is connected to MultiMROutput and the output format is parquet output format, each instance of parquet output stream consumes extra memory. So if a destination task ends up processing lots of small partitions, it ends up exceeding the task memory limit. With the new data routing policy, here is the summary of what each data routing policy does. * reduce_parallelism. The parallelism is decreased to a desired level by having one destination task process multiple consecutive partitions. * fair_parallelism. The parallelism is adjusted to a desired level by having one destination task process multiple consecutive small partitions and multiple destination tasks process one large partition. * The new increase_parallelism. The parallelism is increased to a desired level by having one destination task process each small partition and multiple destination tasks process one large partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029)