[ 
https://issues.apache.org/jira/browse/HIVE-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389816#comment-17389816
 ] 

Paul Lysak commented on HIVE-24706:
-----------------------------------

The problem is that `HiveHBaseTableInputFormat` doesn't properly implement 
`org.apache.hadoop.mapreduce.InputFormat`.
 We also see the exception happening - and it appears that due to this bug it's 
not possible to read any HBase-backed Hive tables in Spark 3.x. 
 The issue was originally described here: 
https://issues.apache.org/jira/browse/SPARK-26630.

A bit of analysis: `org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat` 
implements `org.apache.hadoop.mapreduce.InputFormat`
 but it doesn't override `getSplits(JobContext context)` (unlike 
`getSplits(final JobConf jobConf, final int numSplits)` from the old interface 
`org.apache.hadoop.mapred.InputFormat`), 
 so it gets delegated to the superclass which doesn't initialize the table 
properly.
 Prior to version 3.0, Spark's class `HadoopRDD` was using the old interface 
`org.apache.hadoop.mapred.InputFormat` which has correct implementation in 
`HiveHBaseTableInputFormat`.
 Spark 3.0 has introduced `NewHadoopRDD` which relies on the new interface 
`org.apache.hadoop.mapreduce.InputFormat` for getting the splits, and its 
implementation in `HiveHBaseTableInputFormat`
 is broken - it doesn't initialize the table properly.

Here's the excerpt of the exception stacktrace we're getting:
{code:java}
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 Caused by: java.lang.IllegalStateException: The input format instance has not 
been properly initialized. Ensure you call initializeT
 able either in your constructor or initialize method
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:557)
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:248)
 ... 37 more
 21/07/28 10:04:16 ERROR ApplicationMaster: User class threw exception: 
java.io.IOException: Cannot create a record reader because of
 a previous error. Please look at the previous logs lines from the task's full 
log for more details.
 java.io.IOException: Cannot create a record reader because of a previous 
error. Please look at the previous logs lines from the task
 's full log for more details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:296){code}
 

> Spark SQL access hive on HBase table access exception
> -----------------------------------------------------
>
>                 Key: HIVE-24706
>                 URL: https://issues.apache.org/jira/browse/HIVE-24706
>             Project: Hive
>          Issue Type: Bug
>          Components: HBase Handler
>            Reporter: zhangzhanchang
>            Priority: Major
>         Attachments: image-2021-01-30-15-51-58-665.png
>
>
> Hivehbasetableinputformat relies on two versions of inputformat,one is 
> org.apache.hadoop.mapred.InputFormat, the other is 
> org.apache.hadoop.mapreduce.InputFormat,Causes
> spark 3.0(https://github.com/apache/spark/pull/31302) both conditions to be 
> true:
>  # classOf[oldInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
>  # classOf[newInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
> !image-2021-01-30-15-51-58-665.png|width=430,height=137!
> Hivehbasetableinputformat relies on inputformat to be changed to 
> org.apache.hadoop.mapreduce or org.apache.hadoop.mapred?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to