[
https://issues.apache.org/jira/browse/HIVE-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389816#comment-17389816
]
Paul Lysak commented on HIVE-24706:
-----------------------------------
The problem is that `HiveHBaseTableInputFormat` doesn't properly implement
`org.apache.hadoop.mapreduce.InputFormat`.
We also see the exception happening - and it appears that due to this bug it's
not possible to read any HBase-backed Hive tables in Spark 3.x.
The issue was originally described here:
https://issues.apache.org/jira/browse/SPARK-26630.
A bit of analysis: `org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat`
implements `org.apache.hadoop.mapreduce.InputFormat`
but it doesn't override `getSplits(JobContext context)` (unlike
`getSplits(final JobConf jobConf, final int numSplits)` from the old interface
`org.apache.hadoop.mapred.InputFormat`),
so it gets delegated to the superclass which doesn't initialize the table
properly.
Prior to version 3.0, Spark's class `HadoopRDD` was using the old interface
`org.apache.hadoop.mapred.InputFormat` which has correct implementation in
`HiveHBaseTableInputFormat`.
Spark 3.0 has introduced `NewHadoopRDD` which relies on the new interface
`org.apache.hadoop.mapreduce.InputFormat` for getting the splits, and its
implementation in `HiveHBaseTableInputFormat`
is broken - it doesn't initialize the table properly.
Here's the excerpt of the exception stacktrace we're getting:
{code:java}
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.IllegalStateException: The input format instance has not
been properly initialized. Ensure you call initializeT
able either in your constructor or initialize method
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:557)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:248)
... 37 more
21/07/28 10:04:16 ERROR ApplicationMaster: User class threw exception:
java.io.IOException: Cannot create a record reader because of
a previous error. Please look at the previous logs lines from the task's full
log for more details.
java.io.IOException: Cannot create a record reader because of a previous
error. Please look at the previous logs lines from the task
's full log for more details.
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296){code}
> Spark SQL access hive on HBase table access exception
> -----------------------------------------------------
>
> Key: HIVE-24706
> URL: https://issues.apache.org/jira/browse/HIVE-24706
> Project: Hive
> Issue Type: Bug
> Components: HBase Handler
> Reporter: zhangzhanchang
> Priority: Major
> Attachments: image-2021-01-30-15-51-58-665.png
>
>
> Hivehbasetableinputformat relies on two versions of inputformat,one is
> org.apache.hadoop.mapred.InputFormat, the other is
> org.apache.hadoop.mapreduce.InputFormat,Causes
> spark 3.0(https://github.com/apache/spark/pull/31302) both conditions to be
> true:
> # classOf[oldInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
> # classOf[newInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
> !image-2021-01-30-15-51-58-665.png|width=430,height=137!
> Hivehbasetableinputformat relies on inputformat to be changed to
> org.apache.hadoop.mapreduce or org.apache.hadoop.mapred?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)