NetsanetGeb commented on issue #714: Performance Comparison of 
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477
 
 
   After i used hoodie 0.4.6 version, the performance improved and now its 
taking 4 minutes. 
   
![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)
   
    I also added a similar code of the countByKey to count the records in the 
HoodieDeltaStreamer class and  check why its taking long in the 
HoodieBloomIndex and it took about 9 seconds.  While the countByKey of the 
HoodieBloomIndex is still taking 39 seconds.  This seems of due to parallelism 
because on the first count it have 22 and on the HoodieBloom index its 2 as 
observed from the Spark UI below.  How do we enhance the parallelism of the 
bloom index since hoodie is calculating the parallelism inside without the need 
to set it as configuration?
   
   
![per1](https://user-images.githubusercontent.com/25975892/62196336-1caf1c00-b37e-11e9-89f1-894387485ec7.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to