[GitHub] [hudi] lihuahui5683 opened a new issue, #5382: [SUPPORT] org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim

GitBox Wed, 20 Apr 2022 23:26:09 -0700


lihuahui5683 opened a new issue, #5382:
URL: https://github.com/apache/hudi/issues/5382


   **Describe the problem you faced**
   The following exception occurs when hive incremental query hudi xxx_rt : 
   ```
   22/04/21 10:43:58 INFO scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
(TID 8) on Hadoop02, executor 7: java.lang.ClassCastException 
(org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to 
org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim) [duplicate 3]
   22/04/21 10:43:58 INFO cluster.YarnClusterScheduler: Removed TaskSet 1.0, 
whose tasks have all completed, from pool 
   22/04/21 10:43:58 ERROR client.RemoteDriver: Failed to run client job 
60806c1e-f2b0-4ee5-bbab-46f8238f3493
   java.util.concurrent.ExecutionException: Exception thrown by job
        at 
org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:337)
        at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:342)
        at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:404)
        at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: 
   Aborting TaskSet 1.0 because task 2 (partition 2)
   cannot run anywhere due to node and executor blacklist.
   Most recent failure:
   Lost task 2.0 in stage 1.0 (TID 10, Hadoop02, executor 7): 
java.lang.ClassCastException: 
org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to 
org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim
        at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:205)
        at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:979)
        at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
        at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:272)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:271)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:225)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   I also set the following parameters: 
   ```
   add jar hdfs://mycluster/hudi/jars/hudi-hadoop-mr-bundle-0.10.0.jar;
   set hive.input.format = 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
   set hoodie.role_sync_hive.consume.mode=INCREMENTAL;
   set hoodie.role_sync_hive.consume.max.commits=3;
   set mapreduce.input.fileinputformat.split.maxsize=128;
   set hive.fetch.task.conversion=none;
   set hoodie.role_sync_hive.consume.start.timestamp=20220420143200507;
   ```
   The query statement is as follows:
   ```
   select * from role_sync_hive_rt where `_hoodie_commit_time` > 
'20220420143200507';
   ```
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 2.4.0_cdh6.3.2
   
   * Hive version : 2.1.1_cdh6.3.2
   
   * Hadoop version : 3.0.0_cdh6.3.2
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] lihuahui5683 opened a new issue, #5382: [SUPPORT] org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim

Reply via email to