[ 
https://issues.apache.org/jira/browse/TEZ-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated TEZ-3206:
-------------------------
    Attachment: TEZ-3206-2.patch

Here is the new patch that addresses [~sseth]'s points. Note that the patch 
also switches to use uncompressed size, to be consistent with {{OutputSize}} on 
{{VertexManagerEvent}} payload. This also simplifies the concurrency handling, 
given the update the partition stats will be on the application thread that 
calls into the write method.

The patch also improves the test coverage for scenarios like large record, 
pipelined shuffle, etc.

bq. When using this, one thing to note would be the possibility of repetition 
of data from the same task in case of retries.

ShuffleVertexManager will ignore the subsequent data from the same task. It 
also means we might need to add better support for pipelined shuffle case, 
although that is a separate issue to fix in ShuffleVertexManager which applies 
to both sorted and unsorted scenarios.

{noformat}
    // currently events from multiple attempts of the same task can be ignored 
because
    // their output will be the same. However, with pipelined events that may 
not hold.
    TaskIdentifier producerTask = 
vmEvent.getProducerAttemptIdentifier().getTaskIdentifier();
    if (!taskWithVmEvents.add(producerTask)) {
      LOG.info("Ignoring vertex manager event from: " + producerTask);
      return;
    }
{noformat}


> Have unordered partitioned KV output send partition stats via 
> VertexManagerEvent 
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-3206
>                 URL: https://issues.apache.org/jira/browse/TEZ-3206
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: TEZ-3206-2.patch, TEZ-3206.patch
>
>
> As part of the auto-parallelism feature, ordered partitioned KV output's 
> partition stats are sent to ShuffleVertexManager via VertexManagerEvent. But 
> this isn't available for unordered partitioned output. Having 
> {{UnorderedPartitionedKVWriter}} send partition stats will enable the 
> auto-parallelism support for unordered KV or other custom data routing 
> mechanisms that depend on partition size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to