[
https://issues.apache.org/jira/browse/TEZ-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291574#comment-14291574
]
Bikas Saha commented on TEZ-1993:
---------------------------------
Would it make sense to have TezInputSplit that derives from Mapred InputSplit
and provides additional methods like expectedSize and other stats? so we dont
have to create more pluggable items going forward? The grouping code can check
if the split is an instance of a TezInputSplit and use its methods if so.
> Implement a pluggable InputSizeEstimator for grouping fairly
> ------------------------------------------------------------
>
> Key: TEZ-1993
> URL: https://issues.apache.org/jira/browse/TEZ-1993
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Gopal V
> Assignee: Gopal V
> Attachments: TEZ-1993.1.patch
>
>
> Split grouping is currently done using a file size measurement which is the
> exact size of the split as it stays at rest on HDFS.
> This is not valid for columnar formats and especially suffers from highly
> compressible data skews.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)