Armelabdelkbir opened a new issue, #9918:
URL: https://github.com/apache/hudi/issues/9918

   
   
   **Describe the problem you faced**
   
   Hello community,
   
   i'm using Hudi to change data capture with spark structured streaming + 
kafka + debezium , my jobs works well, sometimes few jobs failed with errors 
related to parquet size or format
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. start  long running replication streams
   
   **Expected behavior**
   
   Write parquet with correct size / format
   
   **Environment Description**
       *Hudi version : 0.10.0
   
       *Spark version : 3.1.4
   
       *Hive version : 1.2.1000
   
       *Storage (HDFS) : 2.7.3
   
       * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   this problem occasionally occurs on certain tables
   this is my config:
   
   ```
   hudi {
     options{
       upsert_parallelisme_value = "1500"
       insert_parallelisme_value = "1500"
       bulk_insert_parallelisme_value = "1500"
       bulk_insert_sort_mode = "NONE"
       parquet_small_file_limit = "104857600"
       streaming_retry_count = "3"
       streaming_retry_interval_ms ="2000"
       parquet_max_file_size = "134217728"
       parquet_block_size = "134217728"
       parquet_page_size = "1048576"
       index_type = "SIMPLE"
       simple.index_use_caching = "true"
       simple.index_input_storage_level = "MEMORY_AND_DISK_SER"
       partition.fields = ""
       generator = "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
       key_generator.hive = "org.apache.hudi.hive.NonPartitionedExtractor"
     }
     compaction {
       inline_compact = "true"
       inline_compact_num_delta_commits = "10"
       cleaner_commits_retained = "4"
       cleaner_policy = "KEEP_LATEST_COMMITS"
       cleaner_fileversions_retained = "3"
       async_clean = "true"
     }
     ```
     MVCC conf:
     
     ```
           HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key -> 
"org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider",
         HoodieLockConfig.ZK_CONNECT_URL.key -> "zookeper-poll:2181",
         HoodieLockConfig.ZK_PORT.key -> "2181",
         HoodieLockConfig.ZK_LOCK_KEY.key -> ( table.table_name),
         HoodieLockConfig.ZK_BASE_PATH.key -> ("/"+table.db_name),
         HoodieLockConfig.LOCK_ACQUIRE_NUM_RETRIES.key -> "15",
         HoodieLockConfig.LOCK_ACQUIRE_CLIENT_NUM_RETRIES.key -> "15",
         HoodieLockConfig.LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS.key -> 
"60000",
         HoodieLockConfig.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS.key -> 
"60000",
         HoodieLockConfig.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS.key -> "20000",
   ```
   
   **Stacktrace**
   
   *for small parquet size
   
   ```
    Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 20 in stage 9.0 failed 4 times, most recent failure: Lost task 
20.3 in stage 9.0 (TID 6057) (ocnode46 executor 2): 
org.apache.hudi.exception.HoodieException: unable to read next record from 
parquet file
   at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
   at 
org.apache.hudi.common.util.ParquetUtils$HoodieKeyIterator.hasNext(ParquetUtils.java:485)
   at java.util.Iterator.forEachRemaining(Iterator.java:115)
   at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:197)
   at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
   at 
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
   at 
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
   at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:117)
   at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
   at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
   at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
   Caused by: java.lang.RuntimeException: 
hdfs://prod/cdc.db/database/table/723c5d09-573b-4df6-ad41-76ae19ec976f-0_2-16682-7063518_20231024224507047.parquet
 is not a Parquet file (too small length: 0)
   at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
   at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:689)
   at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:595)
   at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
   at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
   ... 22 more 
   ```
   one day i had also this error related to parquet format:
   
   ```
   expected magic number at tail [80, 65, 82, 49] but found [2, -70, -67, -119] 
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to