Raghvendradubey opened a new issue #1694:
URL: https://github.com/apache/hudi/issues/1694
Hi Team,
I am reading data from Kafka and ingesting data into Hudi Dataset(MOR)
using Hudi DataSource Api through Spark Structured Streaming.
Pipeline Structure as like -
Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)
Spark - 2.4.5
Hudi - 0.5.2
I am getting performance issues while writing data into Hudi Dataset.
following Hudi Jobs are taking time
countByKey at HoodieBloomIndex.java
countByKey at WorkloadProfile.java
count at HoodieSparkSqlWriter.scala
Configuration used to write hudi data set as followed
new_df.write.format("org.apache.hudi").option("hoodie.table.name",
tableName) \
.option("hoodie.datasource.write.operation", "upsert") \
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
.option("hoodie.datasource.write.recordkey.field", "wbn") \
.option("hoodie.datasource.write.partitionpath.field", "ad") \
.option("hoodie.datasource.write.precombine.field", "action_date") \
.option("hoodie.compact.inline", "true") \
.option("hoodie.compact.inline.max.delta.commits", "300") \
.option("hoodie.datasource.hive_sync.enable", "true") \
.option("hoodie.upsert.shuffle.parallelism", "5") \
.option("hoodie.insert.shuffle.parallelism", "5") \
.option("hoodie.bulkinsert.shuffle.parallelism", "5") \
.option("hoodie.datasource.hive_sync.table", tableName) \
.option("hoodie.datasource.hive_sync.partition_fields", "ad") \
.option("hoodie.index.type","GLOBAL_BLOOM") \
.option("hoodie.bloom.index.update.partition.path", "true") \
.option("hoodie.datasource.hive_sync.assume_date_partitioning", "false")
\
.option("hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.MultiPartKeysValueExtractor") \
.mode("append").save(tablePath)
Spark Submit command -
spark-submit --deploy-mode client --master yarn
--executor-memory 6g --executor-cores 1
--driver-memory 4g
--conf spark.driver.maxResultSize=2g
--conf spark.executor.id=driver
--conf spark.executor.instances=300
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.service.enabled=true
--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.task.cpus=1
--conf spark.yarn.driver.memoryOverhead=1024
--conf spark.yarn.executor.memoryOverhead=3072
--conf spark.yarn.max.executor.failures=100
--jars
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
--py-files s3://spark-test/hudi_job.py
Attaching screen shot for the job details.

countByKey at HoodieBloomIndex.java


countByKey at WorkloadProfile.java


count at HoodieSparkSqlWriter.scala


Please suggest how I can tune this.
Thanks
Raghvendra
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]