[GitHub] [hudi] cb149 edited a comment on issue #3984: [SUPPORT] Upgrade from 0.8.0 to 0.9.0 removes functionality and decreases performance

GitBox Tue, 21 Dec 2021 01:28:03 -0800


cb149 edited a comment on issue #3984:
URL: https://github.com/apache/hudi/issues/3984#issuecomment-998610463



   @nsivabalan I am not specifically setting `hoodie.index.type` so it should 
be using Bloom by default right?
   
   I have changed `hoodie.parquet.small.file.limit` to 0 and use clustering, 
which also improved the performance a bit, but its still worse than in 0.8.0
   
   The most time is spent during file creation, e.g. for this 113MB file
   ```
   21/12/21 09:03:01 INFO io.HoodieCreateHandle: New CreateHandle for partition 
:year=2021/month=12/day=21 with fileId 5ec7339d-b2a3-4cee-97bb-ea60b637e411-0
   21/12/21 09:04:02 INFO io.HoodieCreateHandle: Closing the file 
5ec7339d-b2a3-4cee-97bb-ea60b637e411-0 as we are done with all the records 
3402219
   21/12/21 09:04:03 INFO io.HoodieCreateHandle: CreateHandle for partitionPath 
year=2021/month=12/day=21 fileID 5ec7339d-b2a3-4cee-97bb-ea60b637e411-0, took 
62292 ms.
   ```
   and afterwards another 35 seconds to create a ~56MB second file with ~half 
the records.
   
   Every hour 2 files in the same partition are generated by 2 sequential 
`Getting small files from partitions
   ` stages , is it possible to write them in parallel instead of sequentially?
   
   I am using Impala as a query engine so COW still seems the better choice vs. 
MOR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] cb149 edited a comment on issue #3984: [SUPPORT] Upgrade from 0.8.0 to 0.9.0 removes functionality and decreases performance

Reply via email to