phani482 opened a new issue, #7800:
URL: https://github.com/apache/hudi/issues/7800

   Hello Team,
   
   We are running Glue streaming Job which reads from kinesis and writes to 
Hudi COW table (s3) on glue catalog.
   The Job is running since ~1year without issues. However, lately we started 
seeing OOM errors as below without much insights from the logging.  
   
   a. I tried moving [.commits_.archive] files out of .hoodie folder to reduce 
the size of the .hoodie folder. This helped for a while but the issue started 
to surface again. 
   (s3://<bucket>/prefix/.hoodie/.commits_.archive.1763_1-0-1
   
   b. Here are the write options we are using for Apache Hudi Connector 0.9.0 
             "hoodie.datasource.write.operation": "insert",
               "hoodie.insert.shuffle.parallelism": 10,
               "hoodie.bulkinsert.shuffle.parallelism": 10,
               "hoodie.upsert.shuffle.parallelism": 10,
               "hoodie.delete.shuffle.parallelism": 10,
               "hoodie.parquet.small.file.limit": 8 * 1000 * 1000,  # 8MB
               "hoodie.parquet.max.file.size": 10 * 1000 * 1000,  # 10 MB
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.enable": "false",
               "hoodie.datasource.hive_sync.database": "database_name",
               "hoodie.datasource.hive_sync.table": "raw_table_name",
               "hoodie.datasource.hive_sync.partition_fields": "entity_name",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.support_timestamp": "true",
               "hoodie.keep.min.commits": 1450,  
               "hoodie.keep.max.commits": 1500,  
               "hoodie.cleaner.commits.retained": 1449,
   
   Error:
   ###########
   INFO:py4j.java_gateway:Received command  on object id 
INFO:py4j.java_gateway:Closing down callback connection
   --
   INFO:py4j.java_gateway:Callback Connection ready to receive messages
   INFO:py4j.java_gateway:Received command c on object id p0
   INFO:root:Batch ID: 160325 has 110 records
   ## java.lang.OutOfMemoryError: Requested array size exceeds VM limit# 
-XX:OnOutOfMemoryError="kill -9 %p"#   Executing /bin/sh -c "kill -9 7"...
   ###########
   
   Q: We noticed that ".commits_.archive" files are not being cleaned up by 
hoodie by default. Are there any settings we need to enable for this to happen ?
   
   Q: Our .hoodie folder was ~1.5 GB in size before we started moving archive 
file out of the folder. Is this a hude size for .hoodie folder to be ? What are 
the best practices to maintain .hoodie folder in terms of size and object count?
   
   Q: The error logs doesnt indicate more details, but even after using 20 G.1x 
type DPU on GLue this seems to be not helping. (executor memeory: 10GB, Driver 
memeory 10 GB, executor cores 8). Our workload is not huge, we get few 
thousands of events every hr on avg 1 million records a day is what our job 
processes. The payload size is not more than ~300kb
   
   Please let me know if you need any further details
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to