NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477 After i used hoodie 0.4.6 version, the performance improved and now its taking 4 minutes. ![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png) I also added a similar code to the countByKey for counting the records in the HoodieDeltaStreamer class and check why its taking long in the HoodieBloomIndex and it took about 9 seconds. While the countByKey of the HoodieBloomIndex is still taking 39 seconds. This seems of due to parallelism because on the first count it have 22 and on the HoodieBloom index its 2 as observed from the Spark UI below. How do we enhance the parallelism of the bloom index since hoodie is calculating the parallelism inside without the need to set it as configuration? ![per1](https://user-images.githubusercontent.com/25975892/62196336-1caf1c00-b37e-11e9-89f1-894387485ec7.png)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services