[
https://issues.apache.org/jira/browse/TEZ-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490282#comment-14490282
]
Bikas Saha commented on TEZ-2234:
---------------------------------
Will add annotations.
getDataSize() is the logical data size as written by the user. The closest
thing to that is OUTPUT_BYTES. The difference between them for many jobs is
large enough that perhaps we should look at reducing the overhead.
Yes, plugins are not getting task level info for now. Not needed for PIG-4434.
The docs specify that the values are point in time and may change with
progress/failures/refreshes.
This cannot get rid of VM events as there is no way to correlate between tasks
and output size and so the extrapolation of current output size to final output
size based on current completed tasks to total tasks does not work. So the VM
events are still needed until (if ever) we start exposing task level sizes.
Thanks for the reviews!
> Allow vertex managers to get output size per source vertex
> ----------------------------------------------------------
>
> Key: TEZ-2234
> URL: https://issues.apache.org/jira/browse/TEZ-2234
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-2234.1.patch, TEZ-2234.2.patch, TEZ-2234.3.patch
>
>
> Vertex managers may need per source vertex output stats to make
> reconfiguration decisions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)