[GitHub] [hudi] rafaelhbarros opened a new issue #2083: Kafka readStream performance slow [SUPPORT]

GitBox Thu, 10 Sep 2020 15:17:18 -0700


rafaelhbarros opened a new issue #2083:
URL: https://github.com/apache/hudi/issues/2083



   **Describe the problem you faced**
   
   I have a kafka topic that produces 1-2 million records per minute. I'm 
trying to write these records to s3 in the hudi format.
   I can't get it to keep up with the input. I'm running on EMR, m5.xlarge 
driver, 3x c5.xlarge core instances. The data is serialized in avro, and 
deserialized with schema registry (using abris).
   
   **Environment Description**
   
   ```spark-submit \
       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
       --master yarn \
       --name hudi-consumer \
       --deploy-mode cluster \
       --conf spark.yarn.submit.waitAppCompletion=false \
       --conf spark.scheduler.mode=FAIR \
       --conf spark.task.maxFailures=10 \
       --conf spark.memory.fraction=0.4 \
       --conf spark.rdd.compress=true \
       --conf spark.kryoserializer.buffer.max=512m \
       --conf spark.memory.storageFraction=0.1 \
       --conf spark.shuffle.service.enabled=true \
       --conf spark.sql.hive.convertMetastoreParquet=false \
       --conf spark.driver.maxResultSize=3g \
       --conf spark.yarn.max.executor.failures=10 \
       --conf spark.file.partitions=10 \
       --conf spark.sql.shuffle.partitions=80 \
       --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
-XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError" \
       --conf spark.driver.extraJavaOptions="-XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \
       --driver-memory 4G \
       --executor-memory 5G \
       --executor-cores 4 \
       --num-executors 6 \
       --class <class> <jar>
   ```
   
   Hudi confs:
   
   ```
       hoodie.combine.before.upsert=false 
       hoodie.bulkinsert.shuffle.parallelism=10 
       hoodie.insert.shuffle.parallelism=10 
       hoodie.upsert.shuffle.parallelism=10 
       hoodie.delete.shuffle.parallelism=1
       TABLE_TYPE_OPT_KEY()=COW_TABLE_TYPE_OPT_VAL()
   ```
   
   * Hudi version :
   
   0.5.2-incubating
   
   * Spark version : 2.4.4 (scala 2.12, emr 6.0.0)
   
   * Hive version : N/A
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   No
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rafaelhbarros opened a new issue #2083: Kafka readStream performance slow [SUPPORT]

Reply via email to