[
https://issues.apache.org/jira/browse/TEZ-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376138#comment-15376138
]
ASF GitHub Bot commented on TEZ-3348:
-------------------------------------
GitHub user piyushnarang opened a pull request:
https://github.com/apache/tez/pull/11
TEZ-3348: NullPointerException in Tez MROutput while trying to write using
Parquet's DeprecatedParquetOutputFormat
Proposed fix for the reported jira. Added a couple of unit tests as well.
Seems like if you use the new APIs, this isn't an issue (as they tend to read
`FileOutputFormat.getDefaultWorkFile` which isn't checking the workOutputPath.
In case of the old APIs though without this fix the unit test will fail.
I added a unit test for the new API for completeness.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/piyushnarang/tez master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tez/pull/11.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11
----
commit 6f3e0f4f5718c01f247915f1b84e28c75b2dc83b
Author: Piyush Narang <[email protected]>
Date: 2016-07-14T01:13:45Z
Move initCommitter call up in MROutput
----
> NullPointerException in Tez MROutput while trying to write using Parquet's
> DeprecatedParquetOutputFormat
> --------------------------------------------------------------------------------------------------------
>
> Key: TEZ-3348
> URL: https://issues.apache.org/jira/browse/TEZ-3348
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Piyush Narang
>
> Trying to run some Tez MR jobs that write out some data using Parquet to
> HDFS. When I try to do so, end up seeing a NPE in the Parquet code:
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Path.<init>(Path.java:105)
> at org.apache.hadoop.fs.Path.<init>(Path.java:94)
> at
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
> at
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.access$100(DeprecatedParquetOutputFormat.java:36)
> at
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat$RecordWriterWrapper.<init>(DeprecatedParquetOutputFormat.java:89)
> at
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getRecordWriter(DeprecatedParquetOutputFormat.java:77)
> at
> org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:416)
> {code}
> The flow seems to be:
> 1) The Parquet deprecated output format class tries to read the
> workOutputPath -
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/mapred/DeprecatedParquetOutputFormat.java#L69
> 2) This calls FileOutputFormat.getWorkOutputPath(...) -
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileOutputFormat.java#L229
> 3) That in turn tries to read the JobContext.TASK_OUTPUT_DIR
> ("mapreduce.task.output.dir") constant.
> 4) This ends up being null and in the Parquet code we end up with an NPE in
> the Path class.
> Looking at the Tez code, we are setting the workOutputPath in the
> MROutput.initCommitter method -
> https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/output/MROutput.java#L445.
>
> This call however, is made after the call to access the workOutputPath as
> part of outputFormat.getRecordWriter().
> I tried out a run where I moved this initCommitter call up:
> {code}
> else {
> oldApiTaskAttemptContext =
> new org.apache.tez.mapreduce.hadoop.mapred.TaskAttemptContextImpl(
> jobConf, taskAttemptId,
> new MRTaskReporter(getContext()));
> initCommitter(jobConf, useNewApi); // before the getRecordWriter call
> oldOutputFormat = jobConf.getOutputFormat();
> outputFormatClassName = oldOutputFormat.getClass().getName();
> FileSystem fs = FileSystem.get(jobConf);
> String finalName = getOutputName();
> oldRecordWriter =
> oldOutputFormat.getRecordWriter(
> fs, jobConf, finalName, new
> MRReporter(getContext().getCounters()));
> }
> {code}
> I tried out a run with this and it seems to succeed. If this sounds
> reasonable, I can cut a PR.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)