Vsevolod3 commented on issue #8071: URL: https://github.com/apache/hudi/issues/8071#issuecomment-1591123193
Thank you, Danny. Yes, I got the best performance by using BUCKET index with non-nested, numeric, non-hive partitions. We were actually able to get it to perform in under 3 minutes for the stream_write task when we picked a partitioning field that resulted in fewer partitions (14 partitions) compared to our previous tests (> 90 partitions). It seems the number of partitions Hudi has to manage has a _very_ large impact on performance. Would you be able to share any documents or blog posts that explain partitioning for performance further? I read through most of the documents under Concepts on the Hudi website (e.g. https://hudi.apache.org/docs/next/indexing), but didn't find a lot dealing in depth with partitioning strategies. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
