[GitHub] [hudi] JoshuaZhuCN commented on issue #7602: [SUPPORT] When does the Spark engine's bulk insert mode support bucket index

via GitHub Sat, 28 Jan 2023 23:47:57 -0800


JoshuaZhuCN commented on issue #7602:
URL: https://github.com/apache/hudi/issues/7602#issuecomment-1407590148


   > w/ bucket index, what perf issue you are seeing. From what I know, there 
may not any small file handling only even w/ "insert" as operation type if you 
are using bucket index. So, it should be pretty close to bulk_insert. I mean, 
even if we add bucket index support to bulk_insert, it will perform similar to 
how insert works as of today w/ bulk_insert.
   > 
   > Essentially, we take hash of record key and find the file group to insert. 
and this goes into merge handle where we merge incoming records w/ existing 
file group.
   
   @nsivabalan If the bucket index is written in the insert mode, the log is 
generated first, and the parquet can be regenerated only after the compact is 
triggered. Unlike other index file generation methods, the other index inserts 
generate the parquet, and only up and del can generate the log


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] JoshuaZhuCN commented on issue #7602: [SUPPORT] When does the Spark engine's bulk insert mode support bucket index

Reply via email to