phani482 opened a new issue, #7800:
URL: https://github.com/apache/hudi/issues/7800
Hello Team,
We are running Glue streaming Job which reads from kinesis and writes to
Hudi COW table (s3) on glue catalog.
The Job is running since ~1year without issues. However, lately we started
seeing OOM errors as below without much insights from the logging.
a. I tried moving [.commits_.archive] files out of .hoodie folder to reduce
the size of the .hoodie folder. This helped for a while but the issue started
to surface again.
(s3://<bucket>/prefix/.hoodie/.commits_.archive.1763_1-0-1
b. Here are the write options we are using for Apache Hudi Connector 0.9.0
"hoodie.datasource.write.operation": "insert",
"hoodie.insert.shuffle.parallelism": 10,
"hoodie.bulkinsert.shuffle.parallelism": 10,
"hoodie.upsert.shuffle.parallelism": 10,
"hoodie.delete.shuffle.parallelism": 10,
"hoodie.parquet.small.file.limit": 8 * 1000 * 1000, # 8MB
"hoodie.parquet.max.file.size": 10 * 1000 * 1000, # 10 MB
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.datasource.hive_sync.database": "database_name",
"hoodie.datasource.hive_sync.table": "raw_table_name",
"hoodie.datasource.hive_sync.partition_fields": "entity_name",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.keep.min.commits": 1450,
"hoodie.keep.max.commits": 1500,
"hoodie.cleaner.commits.retained": 1449,
Error:
###########
INFO:py4j.java_gateway:Received command on object id
INFO:py4j.java_gateway:Closing down callback connection
--
INFO:py4j.java_gateway:Callback Connection ready to receive messages
INFO:py4j.java_gateway:Received command c on object id p0
INFO:root:Batch ID: 160325 has 110 records
## java.lang.OutOfMemoryError: Requested array size exceeds VM limit#
-XX:OnOutOfMemoryError="kill -9 %p"# Executing /bin/sh -c "kill -9 7"...
###########
Q: We noticed that ".commits_.archive" files are not being cleaned up by
hoodie by default. Are there any settings we need to enable for this to happen ?
Q: Our .hoodie folder was ~1.5 GB in size before we started moving archive
file out of the folder. Is this a hude size for .hoodie folder to be ? What are
the best practices to maintain .hoodie folder in terms of size and object count?
Q: The error logs doesnt indicate more details, but even after using 20 G.1x
type DPU on GLue this seems to be not helping. (executor memeory: 10GB, Driver
memeory 10 GB, executor cores 8). Our workload is not huge, we get few
thousands of events every hr on avg 1 million records a day is what our job
processes. The payload size is not more than ~300kb
Please let me know if you need any further details
Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]