NetsanetGeb edited a comment on issue #714: Performance Comparison of 
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477
 
 
   After i used hoodie 0.4.6 version, the performance improved and now its 
taking 4 minutes. 
   
   
![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)
   
    I also added a similar code to the countByKey for counting the records in 
the HoodieDeltaStreamer class and  check why its taking long in the 
HoodieBloomIndex and it took about 9 seconds.  While the countByKey of the 
HoodieBloomIndex is still taking 39 seconds.  This change seems to occur  due 
to parallelism because on the first count it have 22 and on the HoodieBloom 
index its 2 as observed from the Spark UI below.  
   
   
![per1](https://user-images.githubusercontent.com/25975892/62196336-1caf1c00-b37e-11e9-89f1-894387485ec7.png)
   
   The effect is clearly seen as we increase the size of the input data from 2 
GB to 27 GB.
   
![per3](https://user-images.githubusercontent.com/25975892/62214909-3f552b00-b3a6-11e9-92b5-df197378795d.png)
   
   
   How do we enhance the parallelism of the bloom index since hoodie is 
calculating the parallelism for bloom index inside without the need to set it 
as configuration?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to