GitHub user weiqingy opened a pull request:
https://github.com/apache/spark/pull/17989
[SPARK-6628][SQL] Fix ClassCastException when executing sql statement
'insert into' on hbase table
## What changes were proposed in this pull request?
The major issue of SPARK-6628 is:
```
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat
```
cannot be cast to
```
org.apache.hadoop.hive.ql.io.HiveOutputFormat
```
The reason is:
```
public interface HiveOutputFormat<K, V> extends OutputFormat<K, V> {â¦}
public class HiveHBaseTableOutputFormat extends
TableOutputFormat<ImmutableBytesWritable> implements
OutputFormat<ImmutableBytesWritable, Object> {...}
```
From the two snippets above, we can see both `HiveHBaseTableOutputFormat`
and `HiveOutputFormat` `extends`/`implements` OutputFormat, and can not cast to
each other.
Spark initials the `outputFormat` in `SparkHiveWriterContainer` of Spark
1.6, 2.0, 2.1 (or: in `HiveFileFormat` of Spark 2.2 /Master)
```
@transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef,
Writable]]
```
Notice: this file output format is `HiveOutputFormat`. However, when users
write the data into the hbase, the outputFormat is
`HiveHBaseTableOutputFormat`, it isn't instance of `HiveOutputFormat`.
This PR is to make `outputFormat` to be "null" when the `OutputFormat` is
not an instance of `HiveOutputFormat`. `outputFormat` is only used to get the
file extension in function `getFileExtension()`.
Spark 2.x also has this issue. We can also submit this PR to Master branch.
## How was this patch tested?
Manually test.
**Before:**
User was trying to write to a hive-hbase table from Spark SQL using
hiveContext and failing with below error:
```
17/03/30 20:26:08 INFO FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing
zookeeper sessionid=0x25acf50c46d05ce
17/03/30 20:26:08 INFO ZooKeeper: Session: 0x25acf50c46d05ce closed
17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down
17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing
zookeeper sessionid=0x35acf50c63305c7
17/03/30 20:26:08 INFO ZooKeeper: Session: 0x35acf50c63305c7 closed
17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down
17/03/30 20:26:08 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.ClassCastException:
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to
org.apache.hadoop.hive.ql.io.HiveOutputFormat
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:74)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:73)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:93)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:119)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:86)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:102)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/03/30 20:26:08 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5,
localhost): java.lang.ClassCastException:
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to
org.apache.hadoop.hive.ql.io.HiveOutputFormat
```
Below is the create Table script :
```
CREATE TABLE `0bq_cntl.spark_load_cntl_stats`( `row_key` string
COMMENT 'from deserializer',
`application` string COMMENT 'from deserializer', `starttime` timestamp
COMMENT 'from deserializer',
`endtime` timestamp COMMENT 'from deserializer', `status` string COMMENT
'from deserializer',
`statusid` smallint COMMENT 'from deserializer', `insertdate` timestamp
COMMENT 'from deserializer',
`count` int COMMENT 'from deserializer', `errordesc` string COMMENT 'from
deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (
'hbase.columns.mapping'='cf1:application,cf1:starttime,cf1:endtime,cf1:Status,cf1:StatusId,cf1:InsertDate,cf1:count,cf1:ErrorDesc',
'line.delim'='\n', 'mapkey.delim'='\u0003',
'serialization.format'='\u0001')TBLPROPERTIES (
'transient_lastDdlTime'='1489696241')
```
Below is the query running using spark sql:
```
val df=sqlContext.sql("Insert into table db1.spark_load_cntl_stats select
'AAM-846d55f6-0ffe-4694-b37a-1637a58f34f2','AAM','2017-03-21
04:03:01','2017-03-21 04:03:01','Started',45,'2017-03-21 04:03:01',1,'ad'")
```
**After:**
The ClassCastException gone. "Insert" succeed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/weiqingy/spark SPARK-6628
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17989.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17989
----
commit 0fa2bb791d1fa9c37fe89c1942ce0ed950a9ee59
Author: Weiqing Yang <[email protected]>
Date: 2017-05-16T00:12:16Z
[SPARK-6628][SQL] Fix ClassCastException when executing sql statement
'insert into' on hbase table
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]