VitoMakarevich commented on PR #10460: URL: https://github.com/apache/hudi/pull/10460#issuecomment-1997839983
@yihua I found that functionality I'm trying to bring heavily collides with https://github.com/apache/hudi/pull/10572. So for the case of loading a list of partitions, there are now the following mechanisms: 1. Get all - was present always, simply get everything - does not well scale with a lot of partitions - in particular our case can spend 10+ minutes in each commit just getting partitions. 2. Try to generate the pushdown filter(quite recent addition with improvements recently) - if approximately it can fit 2048 characters - generate a list of all partitions, otherwise get min/max from changed partitions and read within range. This again has an error in 2048 limit - as it depends on partition depth/name length and so on, and min/max is better than nothing but suffers from entropy. 3. Mechanism I use - simply call batchGetPartition with all partitions changed - scales almost indefinitely. And basically 1st and 2nd are what is in `master` branch - but with my approach, we don't need it at all, since `get all` will be needed only initially(when creating the Glue database), then we can use incremental only. **But since it's in master, we may need to have backward compatibility, can you suggest a course forward?** I can make a feature flag to use this behavior under certain feature flags - but would like to use it only for HiveSyncTool - where it makes more sense, and for AWS's implementation use my mechanism. As for the rest - such as improvement of all partitions listing/parallelized update/create/delete - it does not collide with anything. Waiting for your advice, I'm now stuck because of falling IT - but I need to know the direction we go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
