Raghvendradubey opened a new issue #1694:
URL: https://github.com/apache/hudi/issues/1694


   Hi Team,
   
   I am reading data from Kafka and ingesting  data into Hudi Dataset(MOR) 
using Hudi DataSource Api through Spark Structured Streaming.
   Pipeline Structure as like - 
   
   Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)
   
   Spark - 2.4.5
   Hudi - 0.5.2
   
   I am getting performance issues while writing data into Hudi Dataset. 
   following Hudi Jobs are taking time
   countByKey at HoodieBloomIndex.java
   countByKey at WorkloadProfile.java
   count at HoodieSparkSqlWriter.scala
   
   Configuration used to write hudi data set as followed 
   new_df.write.format("org.apache.hudi").option("hoodie.table.name", 
tableName) \
       .option("hoodie.datasource.write.operation", "upsert") \
       .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
       .option("hoodie.datasource.write.recordkey.field", "wbn") \
       .option("hoodie.datasource.write.partitionpath.field", "ad") \
       .option("hoodie.datasource.write.precombine.field", "action_date") \
       .option("hoodie.compact.inline", "true") \
       .option("hoodie.compact.inline.max.delta.commits", "300") \
       .option("hoodie.datasource.hive_sync.enable", "true") \
       .option("hoodie.upsert.shuffle.parallelism", "5") \
       .option("hoodie.insert.shuffle.parallelism", "5") \
       .option("hoodie.bulkinsert.shuffle.parallelism", "5") \
       .option("hoodie.datasource.hive_sync.table", tableName) \
       .option("hoodie.datasource.hive_sync.partition_fields", "ad") \
       .option("hoodie.index.type","GLOBAL_BLOOM") \
       .option("hoodie.bloom.index.update.partition.path", "true") \
       .option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") 
\
       .option("hoodie.datasource.hive_sync.partition_extractor_class",
               "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
       .mode("append").save(tablePath)
   
   Spark Submit command - 
   spark-submit --deploy-mode client --master yarn
   --executor-memory 6g --executor-cores 1
    --driver-memory 4g 
   --conf spark.driver.maxResultSize=2g
    --conf spark.executor.id=driver
    --conf spark.executor.instances=300
    --conf spark.kryoserializer.buffer.max=512m
    --conf spark.shuffle.service.enabled=true
    --conf spark.sql.hive.convertMetastoreParquet=false
    --conf spark.task.cpus=1
    --conf spark.yarn.driver.memoryOverhead=1024
    --conf spark.yarn.executor.memoryOverhead=3072
    --conf spark.yarn.max.executor.failures=100
     --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
    --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
    --py-files s3://spark-test/hudi_job.py
   Attaching screen shot for the job details.
   
![hudi-job](https://user-images.githubusercontent.com/16387812/83402829-1cabfc80-a425-11ea-868b-e7c66204ac1e.png)
   
   countByKey at HoodieBloomIndex.java
   
![countbykey-hoodiebloomindx](https://user-images.githubusercontent.com/16387812/83402914-4cf39b00-a425-11ea-8bf3-6a21643d1480.png)
   
![countbykeyhoodiebloomindextask](https://user-images.githubusercontent.com/16387812/83402931-554bd600-a425-11ea-8047-183be072346c.png)
   
   countByKey at WorkloadProfile.java
   
![workloadprofile](https://user-images.githubusercontent.com/16387812/83402997-71e80e00-a425-11ea-9d2d-52e8765b20fc.png)
   
![workloadprofiletask](https://user-images.githubusercontent.com/16387812/83403022-7f9d9380-a425-11ea-9175-f0e43763f4a9.png)
   
   count at HoodieSparkSqlWriter.scala
   
![hoodiesparksqlwriter](https://user-images.githubusercontent.com/16387812/83403066-9348fa00-a425-11ea-9643-26cda91e854f.png)
   
![sparksqlwritertask](https://user-images.githubusercontent.com/16387812/83403078-993edb00-a425-11ea-9a20-1268dd92f813.png)
   
   Please suggest how I can tune this.
   
   Thanks
   Raghvendra


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to