Re: [PR] [HUDI-7466] Add parallel listing of existing partitions in Glue Catalog sync [hudi]

via GitHub Thu, 14 Mar 2024 09:21:51 -0700


VitoMakarevich commented on PR #10460:
URL: https://github.com/apache/hudi/pull/10460#issuecomment-1997839983


   @yihua I found that functionality I'm trying to bring heavily collides with 
https://github.com/apache/hudi/pull/10572.
   So for the case of loading a list of partitions, there are now the following 
mechanisms:
   1. Get all - was present always, simply get everything - does not well scale 
with a lot of partitions - in particular our case can spend 10+ minutes in each 
commit just getting partitions.
   2. Try to generate the pushdown filter(quite recent addition with 
improvements recently) - if approximately it can fit 2048 characters - generate 
a list of all partitions, otherwise get min/max from changed partitions and 
read within range. This again has an error in 2048 limit - as it depends on 
partition depth/name length and so on, and min/max is better than nothing but 
suffers from entropy.
   3. Mechanism I use - simply call batchGetPartition with all partitions 
changed - scales almost indefinitely.
   
   And basically 1st and 2nd are what is in `master` branch - but with my 
approach, we don't need it at all, since `get all` will be needed only 
initially(when creating the Glue database), then we can use incremental only. 
**But since it's in master, we may need to have backward compatibility, can you 
suggest a course forward?** I can make a feature flag to use this behavior 
under certain feature flags - but would like to use it only for HiveSyncTool - 
where it makes more sense, and for AWS's implementation use my mechanism.
   
   As for the rest - such as improvement of all partitions listing/parallelized 
update/create/delete - it does not collide with anything.
   Waiting for your advice, I'm now stuck because of falling IT - but I need to 
know the direction we go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7466] Add parallel listing of existing partitions in Glue Catalog sync [hudi]

Reply via email to