[
https://issues.apache.org/jira/browse/HUDI-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen closed HUDI-7466.
----------------------------
Fix Version/s: 0.15.0
1.0.0
Resolution: Fixed
Fixed via master branch: e726306cf0905512e3afe4b875cb15dfbef40e52
> AWS Glue sync
> -------------
>
> Key: HUDI-7466
> URL: https://issues.apache.org/jira/browse/HUDI-7466
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Vitali Makarevich
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Ticket to track
> [https://github.com/apache/hudi/pull/10460]
> Currently, AWS Glue sync works and provides 2 interfaces - one is
> {{HoodieHiveSyncClient}} using Hive, then Glue -> Hive implementation(hidden
> by AWS), and another is {{{}AWSGlueCatalogSyncClient{}}}.
> Both of them have limitations - although Hive has improved a bit to use
> pushdown on a big scale still fails and fallback may not work for certain
> partitioning schemes.
> For syncing to Glue using {{{}HoodieHiveSyncClient{}}}, there is a set of
> limitations:
> # Create/update is not parallelized under the hood, meaning for big sets
> it's very slow - empirically it's about 40 partitions/sec MAX, which
> translates to minutes for bigger scale.
> # The pushdown filter is not really effective since for 1st case(specifying
> exact partitions) - it works unpredictably, since the longer the partition
> value you have, the fewer partitions you can specify, in our case we cannot
> specify > 100 partitions, therefore it falls back to min-max predicate.
> # Min-max predicate does not work if the number of partitions is growing
> with nesting, e.g. on level 1 there are 10, on level 2 there are 100, on
> level 3 there are 1000. In this case, min-max will cut down high-level ones,
> but load all levels down, therefore not really making optimization.
> # When there is e.g. a schema change, Hive-Glue calls {{{}cascade{}}}, and
> for big tables it's impossible to sync in meaningful time - although for Glue
> -> Hudi does not specify schema on partition level, so this is wasted effort.
> This is why {{AWSGlueCatalogSyncClient}} is preferable. But there are other
> problems with it.
> Particular list of problems:
> # Create/Update/Delete were not optimized before - now optimized to be
> async, but without a meaningful high border, it will simply reach the request
> limit and stay there. *This solution adds a parameter for such parallelism
> and creates parallelization logic.*
> # Listing all partitions is used always for {{AWSGlueCatalogSyncClient}} -
> this is way suboptimal since the goal of this is to distinguish which of
> changed-since-last-sync are created and which are deletes, therefore more
> optimal API can be used -
> [{{BatchGetPartition}}|https://docs.aws.amazon.com/glue/latest/webapi/API_BatchGetPartition.html].
> Also, it can be parallelized easily. *I added a new method to sync client
> classes and moved Hive-pushdown into a Hive-specific class and implemented
> this method for the AWSGlue client class. Also, parameter controlling
> parallelism is added.*
> # Listing all partitions is suboptimal - it is still needed for initial
> sync/resync, but it's done in a straightforward way and is suboptimal. In
> particular - it uses basic {{nextToken}} which makes it sequential and works
> slowly in heavily partitioned tables. AWS has an improvement for this
> particular
> [method|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html],
> called {{{}segment{}}}. This allows us to basically create 1 to 10 start
> positions and use standard({{{}nextToken{}}}) API to list partitions. Also -
> [last public version of Hive-> Glue interface implementation uses
> it|https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html].
> When we switched from the Hive sync class to AWS Glue specific - first what
> we faced is performance degradation with the listing. *I added {{segment}}
> API parameter usage and added parameter controlling parallelism.*
> All this has been tested for a partitioned table with >200k partitions.
> I managed to get speed improvement from 2-3 minutes to 3 seconds. Let me know
> if you are interested in numbers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)