[ 
https://issues.apache.org/jira/browse/TEZ-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348059#comment-14348059
 ] 

Jeff Zhang edited comment on TEZ-2162 at 3/5/15 3:26 AM:
---------------------------------------------------------

Upload new patch ( add unit test for MROutput & MROutputConfigBuilder)

bq. For the case of a map-only MR job, does this work? i.e. should use new api 
check be based on whether it is a mapper/reducer? In any case, relying on one 
of the two is probably not useful for Pig/Hive which dont use MR processors. Is 
there a way to explicitly configure which api via the config builder?
It doesn't matter whether MROutput is in mapper or reducer. useNewApi is 
determined by the OutputFormat user specified when creating MROutput, and 
MROutput ensure the consistency between field useNewAPI and the actual 
OutputFormat. field useNewApi and outputFormat class is wrapped in payload 
which is passed to Task. At runtime, MROutput use the useNewApi to initialize 
the correct OutputFormat class.  Here's code how MROutput create payload (set 
property mapred.reducer.new-api in payload, so we should also use this property 
at runtime)
{code}
    private UserPayload createUserPayload() {
      if (outputFormatProvided) {
        conf.setBoolean(MRJobConfig.NEW_API_REDUCER_CONFIG, useNewApi);
        if (useNewApi) {
          conf.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, 
outputFormat.getName());
        } else {
          conf.set("mapred.output.format.class", outputFormat.getName());
        }
      }
      MRHelpers.translateMRConfToTez(conf);
      try {
        return TezUtils.createUserPayloadFromConf(conf);
      } catch (IOException e) {
        throw new TezUncheckedException(e);
      }
    }
{code}

But there's one exceptional case that if user specify OutputFormat through 
conf, then even MROutput is on mapper side, user still need to use property 
MRJobConfig.NEW_API_REDUCER_CONFIG rather than NEW_API_MAPPER_CONFIG.





was (Author: zjffdu):
Upload new patch ( add unit test for MROutput & MROutputConfigBuilder)

bq. For the case of a map-only MR job, does this work? i.e. should use new api 
check be based on whether it is a mapper/reducer? In any case, relying on one 
of the two is probably not useful for Pig/Hive which dont use MR processors. Is 
there a way to explicitly configure which api via the config builder?
It doesn't matter whether MROutput is in mapper or reducer. useNewApi is 
determined by the OutputFormat user specified when creating MROutput, and 
MROutput ensure the consistency between field useNewAPI and the actual 
OutputFormat. field useNewApi and outputFormat class is wrapped in payload 
which is passed to Task. At runtime, MROutput use the useNewApi to initialize 
the correct OutputFormat class.  Here's code how MROutput create payload (use 
property mapred.reducer.new-api, so we should also use this property at runtime)
{code}
    private UserPayload createUserPayload() {
      if (outputFormatProvided) {
        conf.setBoolean(MRJobConfig.NEW_API_REDUCER_CONFIG, useNewApi);
        if (useNewApi) {
          conf.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, 
outputFormat.getName());
        } else {
          conf.set("mapred.output.format.class", outputFormat.getName());
        }
      }
      MRHelpers.translateMRConfToTez(conf);
      try {
        return TezUtils.createUserPayloadFromConf(conf);
      } catch (IOException e) {
        throw new TezUncheckedException(e);
      }
    }
{code}

But there's one exceptional case that if user specify OutputFormat through 
conf, then even MROutput is on mapper side, user still need to use property 
MRJobConfig.NEW_API_REDUCER_CONFIG rather than NEW_API_MAPPER_CONFIG.




> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat is not 
> recognized
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-2162
>                 URL: https://issues.apache.org/jira/browse/TEZ-2162
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.3
>            Reporter: Oleg Zhurakousky
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-2162-1.patch, TEZ-2162-2.patch
>
>
> {code}
> DataSinkDescriptor dataSink = MROutput.createConfigBuilder(dsConfig, 
> outputFormatClass, outputPath).build();
> {code}
> if output format class is 
> _org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat_ I end up 
> with _TextOutputFormat_, however if it is 
> _org.apache.hadoop.mapred.SequenceFileOutputFormat_, then all good.
> For now that can be a workaround to deal with SequenceFiles but I think that 
> is the old API and it seems like Tez is having some issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to