[
https://issues.apache.org/jira/browse/TEZ-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated TEZ-3206:
-------------------------
Attachment: TEZ-3206.patch
Thanks [~sseth]. Yes, it is due to the fact that Tez makes use of fewer bits
for encode the size.
Here is the draft patch to have {{UnorderedPartitionedKVWriter}} send partition
stats in terms of compressed output size via VertexManagerEvent. Given
compressed size is only available after spill and could be called on the spill
finish callback threads, make the global stat thread safe.
Note that in the current protocol for sorted partitioned case
{{VertexManagerEventPayloadProto.Builder}}'s {{setOutputSize}} takes
uncompressed size, but {{setPartitionStats}} takes compressed size. Based on
how {{ShuffleVertexManager}} consumes partition stats, it doesn't matter if it
is compressed or not.
Maybe we should use uncompressed size for partition stats? If so, the patch
will be simpler. And I can file a separate jira to have sorted partitioned
switch to send uncompressed size.
> Have unordered partitioned KV output send partition stats via
> VertexManagerEvent
> ---------------------------------------------------------------------------------
>
> Key: TEZ-3206
> URL: https://issues.apache.org/jira/browse/TEZ-3206
> Project: Apache Tez
> Issue Type: New Feature
> Reporter: Ming Ma
> Attachments: TEZ-3206.patch
>
>
> As part of the auto-parallelism feature, ordered partitioned KV output's
> partition stats are sent to ShuffleVertexManager via VertexManagerEvent. But
> this isn't available for unordered partitioned output. Having
> {{UnorderedPartitionedKVWriter}} send partition stats will enable the
> auto-parallelism support for unordered KV or other custom data routing
> mechanisms that depend on partition size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)