cb149 edited a comment on issue #3984: URL: https://github.com/apache/hudi/issues/3984#issuecomment-998610463
@nsivabalan I am not specifically setting `hoodie.index.type` so it should be using Bloom by default right? I have changed `hoodie.parquet.small.file.limit` to 0 and use clustering, which also improved the performance a bit, but its still worse than in 0.8.0 The most time is spent during file creation, e.g. for this 113MB file ``` 21/12/21 09:03:01 INFO io.HoodieCreateHandle: New CreateHandle for partition :year=2021/month=12/day=21 with fileId 5ec7339d-b2a3-4cee-97bb-ea60b637e411-0 21/12/21 09:04:02 INFO io.HoodieCreateHandle: Closing the file 5ec7339d-b2a3-4cee-97bb-ea60b637e411-0 as we are done with all the records 3402219 21/12/21 09:04:03 INFO io.HoodieCreateHandle: CreateHandle for partitionPath year=2021/month=12/day=21 fileID 5ec7339d-b2a3-4cee-97bb-ea60b637e411-0, took 62292 ms. ``` and afterwards another 35 seconds to create a ~56MB second file with ~half the records. Every hour I can see 2 files in the same partition are generated by 2 sequential `Getting small files from partitions` stages , is it possible to write them in parallel instead of sequentially? I am using Impala as a query engine so COW still seems the better choice vs. MOR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
