Rap70r edited a comment on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-933893616


   Hi @xushiyan,
   
   Here is an update for our latest tests. I have switched to d3.xlarge 
instance type and used the following configs:
   `spark-submit --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf 
spark.shuffle.service.enabled=true --conf 
spark.sql.hive.convertMetastoreParquet=false --conf 
spark.driver.maxResultSize=6g --conf spark.driver.memory=17g --conf 
spark.executor.cores=2 --conf 
spark.hadoop.parquet.enable.summary-metadata=false --conf 
spark.driver.memoryOverhead=6g --conf spark.network.timeout=600s --conf 
spark.executor.instances=50 --conf spark.executor.memoryOverhead=4g --conf 
spark.driver.cores=2 --conf spark.executor.memory=8g --conf 
spark.memory.storageFraction=0.1 --conf spark.executor.heartbeatInterval=120s 
--conf spark.memory.fraction=0.4 --conf spark.rdd.compress=true --conf 
spark.kryoserializer.buffer.max=200m --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.shuffle.partitions=200 --conf spark.default.parallelism=200 --conf 
spark.task.cpus=2`
   
   I also removed "spark.sql.parquet.mergeSchema".
   
   I have noticed a significant increase of speed for all the steps except the 
one that extracts events from Kafka. That step I can't seem to improve. We are 
using st1 high throughput ebs that is attached to the emr's master node. The 
topic is compacted and it contains ~50 million records across 50 partitions. 
Even with the above powerful instance it takes 40 minutes to extract all 
records.
   Basically, the part that is slow is the partition seeking part. It takes 
couple of minutes to seek from offset 50000 to 100000.
   Do you have any suggestions on how to improve data ingestion from kafka 
using spark structured streaming?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to