[GitHub] spark pull request #17989: [SPARK-6628][SQL] Fix ClassCastException when exe...

weiqingy Mon, 15 May 2017 17:29:32 -0700

GitHub user weiqingy opened a pull request:

    https://github.com/apache/spark/pull/17989


    [SPARK-6628][SQL] Fix ClassCastException when executing sql statement 
'insert into' on hbase table

    ## What changes were proposed in this pull request?
    
    The major issue of SPARK-6628 is:
    ```
    org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat 
    ```
    cannot be cast to
    ```
    org.apache.hadoop.hive.ql.io.HiveOutputFormat
    ```
    The reason is:
    ```
    public interface HiveOutputFormat<K, V> extends OutputFormat<K, V> {â¦}
    
    public class HiveHBaseTableOutputFormat extends
        TableOutputFormat<ImmutableBytesWritable> implements
        OutputFormat<ImmutableBytesWritable, Object> {...}
    ```
    From the two snippets above, we can see both `HiveHBaseTableOutputFormat` 
and `HiveOutputFormat` `extends`/`implements` OutputFormat, and can not cast to 
each other. 
    
    Spark initials the `outputFormat` in `SparkHiveWriterContainer` of Spark 
1.6, 2.0, 2.1 (or: in `HiveFileFormat` of Spark 2.2 /Master)
    ```
    @transient private lazy val outputFormat =
            jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]
    ```
    Notice: this file output format is  `HiveOutputFormat`. However, when users 
write the data into the hbase, the outputFormat is 
`HiveHBaseTableOutputFormat`, it isn't instance of `HiveOutputFormat`.
    
    This PR is to make `outputFormat` to be "null" when the `OutputFormat` is 
not an instance of `HiveOutputFormat`. `outputFormat` is only used to get the 
file extension in function `getFileExtension()`. 
    
    Spark 2.x also has this issue. We can also submit this PR to Master branch.
    
    ## How was this patch tested?
    Manually test.
    
    **Before:**
    
    User was trying to write to a hive-hbase table from Spark SQL using 
hiveContext and failing with below error:
    ```
    17/03/30 20:26:08 INFO FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
    17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing 
zookeeper sessionid=0x25acf50c46d05ce
    17/03/30 20:26:08 INFO ZooKeeper: Session: 0x25acf50c46d05ce closed
    17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down
    17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing 
zookeeper sessionid=0x35acf50c63305c7
    17/03/30 20:26:08 INFO ZooKeeper: Session: 0x35acf50c63305c7 closed
    17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down
    17/03/30 20:26:08 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
    java.lang.ClassCastException: 
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
org.apache.hadoop.hive.ql.io.HiveOutputFormat
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:74)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:73)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:93)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:119)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:86)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:102)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    17/03/30 20:26:08 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
localhost): java.lang.ClassCastException: 
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
org.apache.hadoop.hive.ql.io.HiveOutputFormat
    ```
    
    Below is the create Table script :
    ```
    CREATE      TABLE `0bq_cntl.spark_load_cntl_stats`(  `row_key` string 
COMMENT 'from deserializer',
    `application` string COMMENT 'from deserializer',   `starttime` timestamp 
COMMENT 'from deserializer',
    `endtime` timestamp COMMENT 'from deserializer',   `status` string COMMENT 
'from deserializer',
    `statusid` smallint COMMENT 'from deserializer',   `insertdate` timestamp 
COMMENT 'from deserializer',
    `count` int COMMENT 'from deserializer',   `errordesc` string COMMENT 'from 
deserializer')
    ROW FORMAT SERDE  
     'org.apache.hadoop.hive.hbase.HBaseSerDe' 
    STORED BY
      'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 
  
'hbase.columns.mapping'='cf1:application,cf1:starttime,cf1:endtime,cf1:Status,cf1:StatusId,cf1:InsertDate,cf1:count,cf1:ErrorDesc',
    'line.delim'='\n',   'mapkey.delim'='\u0003',   
'serialization.format'='\u0001')TBLPROPERTIES (  
'transient_lastDdlTime'='1489696241')
    ```
    Below is the query running using spark sql:
    ```
    val df=sqlContext.sql("Insert into table db1.spark_load_cntl_stats select 
'AAM-846d55f6-0ffe-4694-b37a-1637a58f34f2','AAM','2017-03-21 
04:03:01','2017-03-21 04:03:01','Started',45,'2017-03-21 04:03:01',1,'ad'")
    ```
    **After:**
    The ClassCastException gone. "Insert" succeed. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/weiqingy/spark SPARK-6628

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17989.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17989
    
----
commit 0fa2bb791d1fa9c37fe89c1942ce0ed950a9ee59
Author: Weiqing Yang <[email protected]>
Date:   2017-05-16T00:12:16Z

    [SPARK-6628][SQL] Fix ClassCastException when executing sql statement 
'insert into' on hbase table

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17989: [SPARK-6628][SQL] Fix ClassCastException when exe...

Reply via email to