rafaelhbarros opened a new issue #2083:
URL: https://github.com/apache/hudi/issues/2083
**Describe the problem you faced**
I have a kafka topic that produces 1-2 million records per minute. I'm
trying to write these records to s3 in the hudi format.
I can't get it to keep up with the input. I'm running on EMR, m5.xlarge
driver, 3x c5.xlarge core instances. The data is serialized in avro, and
deserialized with schema registry (using abris).
**Environment Description**
```spark-submit \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--master yarn \
--name hudi-consumer \
--deploy-mode cluster \
--conf spark.yarn.submit.waitAppCompletion=false \
--conf spark.scheduler.mode=FAIR \
--conf spark.task.maxFailures=10 \
--conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.memory.storageFraction=0.1 \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.driver.maxResultSize=3g \
--conf spark.yarn.max.executor.failures=10 \
--conf spark.file.partitions=10 \
--conf spark.sql.shuffle.partitions=80 \
--conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError" \
--conf spark.driver.extraJavaOptions="-XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \
--driver-memory 4G \
--executor-memory 5G \
--executor-cores 4 \
--num-executors 6 \
--class <class> <jar>
```
Hudi confs:
```
hoodie.combine.before.upsert=false
hoodie.bulkinsert.shuffle.parallelism=10
hoodie.insert.shuffle.parallelism=10
hoodie.upsert.shuffle.parallelism=10
hoodie.delete.shuffle.parallelism=1
TABLE_TYPE_OPT_KEY()=COW_TABLE_TYPE_OPT_VAL()
```
* Hudi version :
0.5.2-incubating
* Spark version : 2.4.4 (scala 2.12, emr 6.0.0)
* Hive version : N/A
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) :
S3
* Running on Docker? (yes/no) :
No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]