[ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013138#comment-16013138
 ] 

Rohini Palaniswamy commented on TEZ-3391:
-----------------------------------------

Doing these two things will save couple of millis in each map vertex.

1) Moving the validation checks to AM
2) In the vertex construct TaskSplitMetaInfo only for the split of that task 
instead of constructing for all splits. ie change
public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, 
FileSystem fs) to public static TaskSplitMetaInfo 
getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading 
splits below the index. If there are 1000 splits, the first task will read 1 
split, second task will read 2 splits and so on instead of each task reading 
all the 1000 splits as is happening now.

SplitMetaInfoReaderTez.java
{code}
try {
      JobSplit.SplitMetaInfo splitMetaInfo = new JobSplit.SplitMetaInfo();
      for (int i = 0; i < numSplits; i++) {
        splitMetaInfo.readFields(in);
        if (i == index) {
        return new JobSplit.TaskSplitMetaInfo(splitIndex,
            splitMetaInfo.getLocations(), splitMetaInfo.getInputDataLength());
        }
      }
    } finally {
      in.close();
    }
{code}


> MR split file validation should be done in the AM
> -------------------------------------------------
>
>                 Key: TEZ-3391
>                 URL: https://issues.apache.org/jira/browse/TEZ-3391
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>
>   We had a case  where Split metadata size exceeded 10000000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to