TarunMootala opened a new issue, #11122:
URL: https://github.com/apache/hudi/issues/11122

   **Describe the problem you faced**
   We have spark streaming job that reads data from an input stream and appends 
the data to a COW table partitioned on subject area. This streaming job has a 
batch internal of 120 seconds.
   
   Intermittently the job is failing with error 
   
   ```
   java.lang.OutOfMemoryError: Requested array size exceeds VM limit
   ```
   
   Debugged multiple failures, always failing at the stage `collect at 
HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)`
   
   
   **To Reproduce**
   
   No specific steps. 
   
   **Expected behavior**
   
   The job should commit the data successfully and continue with next micro 
batch. 
   
   **Environment Description**
   
   * Hudi version : 0.12.1 (Glue 4.0)
   
   * Spark version : Spark 3.3.0
   
   * Hive version : N/A
   
   * Hadoop version : N/A
   
   * Storage (HDFS/S3/GCS..) : S3 
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   We are not sure on the exact fix and root cause. However, the workaround 
(not ideal) is to manually delete (archive) few of the oldest Hudi metadata 
from Active timeline (`.hoodie` folder) and reduce `hoodie.keep.max.commits`. 
This is only working when we reduce max commits, and whenever the max commits 
are reduced it run perfectly for few months before failing again.
   
   Our requirement is to store 1500 commits to enable incremental query 
capability on last 2 days of changes. Initially we started with max commits of 
1500 and gradually came down to 400. 
   
   **Hudi Config**
   
   ```
               "hoodie.table.name": "table_name",
               "hoodie.datasource.write.keygenerator.type": "COMPLEX",
               "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
               "hoodie.datasource.write.partitionpath.field": "entity_name",
               "hoodie.datasource.write.recordkey.field": 
"partition_key,sequence_number",
               "hoodie.datasource.write.precombine.field": 
"approximate_arrival_timestamp",
               "hoodie.datasource.write.operation": "insert",
               "hoodie.insert.shuffle.parallelism": 10,
               "hoodie.bulkinsert.shuffle.parallelism": 10,
               "hoodie.upsert.shuffle.parallelism": 10,
               "hoodie.delete.shuffle.parallelism": 10,
               "hoodie.metadata.enable": "false",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.enable": "false",
               "hoodie.datasource.hive_sync.database": "database_name",
               "hoodie.datasource.hive_sync.table": "table_name",
               "hoodie.datasource.hive_sync.partition_fields": "entity_name",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.support_timestamp": "true",
               "hoodie.keep.min.commits": 450,  # to preserve commits for at 
least 2 days with processingTime="120 seconds"
               "hoodie.keep.max.commits": 480,  # to preserve commits for at 
least 2 days with processingTime="120 seconds"
               "hoodie.cleaner.commits.retained": 449,
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```
   java.lang.OutOfMemoryError: Requested array size exceeds VM limit
   ```
   
   Debugged multiple failure logs, always failing at the stage `collect at 
HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to