[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

liyunzhang (JIRA) Wed, 24 Jan 2018 17:53:09 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338586#comment-16338586
 ]


liyunzhang commented on HIVE-18301:
-----------------------------------

[~lirui]:  {quote}
does the proposed solution mean we need to cache the input path for each record 
of the table?

{quote}
yes.

{quote}
I wonder whether we can reuse the Text for same input paths. Besides, it's not 
efficient to check the job conf each time we process a row. You can just check 
it once and remember the value.
{quote}
OK, I will reuse to reduce some computation.
[~lirui], [~xuefuz],[~csun]:

The concern to this solution is 
1. it may make the RDD bigger( I don't compare the original tuple._1() and 
inputPath, will do it later)
2. I modify the {{HadoopRDD-CopyRDD-MT-11}}, here CopyRDD does not store the 
original data anymore, CopyRDD store the inputPath and orignal data.  It maybe 
confusing.

Do you think idea is ok to commit, if not ok, maybe I need to try other 
solution or can you provide other better solution? Thanks!

> Investigate to enable MapInput cache in Hive on Spark
> -----------------------------------------------------
>
>                 Key: HIVE-18301
>                 URL: https://issues.apache.org/jira/browse/HIVE-18301
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>            Priority: Major
>         Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>       at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>       at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>       at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>       at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>       at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>       at org.apache.spark.scheduler.Task.run(Task.scala:85)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>       at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>       ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

Reply via email to