NetsanetGeb commented on issue #714: Performance Comparison of 
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-511349113
 
 
   I changed the driver memory and number of executors to be:
   spark.driver.memory = 7168m
   spark.executor.memory = 1024m
   spark.executor.instances =20
   
    And in addition, i  added the following settings: 
   ``` spark.yarn.driver.memoryOverhead =1024
   spark.yarn.executor.memoryOverhead=3072
    spark.kryoserializer.buffer.max=512m
   spark.serializer=org.apache.spark.serializer.KryoSerializer
   spark.shuffle.memoryFraction=0.2
   spark.shuffle.service.enabled=true
   spark.sql.hive.convertMetastoreParquet=false
   spark.storage.memoryFraction=0.6 
   spark.rdd.compress=true ```
   
   Then the performance improved from 38 minutes to 19 minutes. But still this 
is  not optimized as it should be because its taking to much time for 1888 MB 
of data.  For further follow up am attaching the spark UI of  job with the 
changed configurations.
   
   
![spark1](https://user-images.githubusercontent.com/25975892/61210045-98149a80-a6fb-11e9-994a-c52645e117ee.png)
   
   
![spark2](https://user-images.githubusercontent.com/25975892/61210053-9e0a7b80-a6fb-11e9-8bd6-5841e7aa92ca.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to