Re: [PR] [HUDI-7466] Add parallel listing of existing partitions in Glue Catalog sync [hudi]

via GitHub Thu, 14 Mar 2024 06:27:00 -0700


VitoMakarevich commented on code in PR #10460:
URL: https://github.com/apache/hudi/pull/10460#discussion_r1524878225



##########
hudi-aws/src/main/java/org/apache/hudi/config/GlueCatalogSyncClientConfig.java:
##########
@@ -40,6 +42,28 @@ public class GlueCatalogSyncClientConfig extends 
HoodieConfig {
       .sinceVersion("0.14.0")
       .withDocumentation("Glue catalog sync based client will skip archiving 
the table version if this config is set to true");
 
+  public static final ConfigProperty<Integer> ALL_PARTITIONS_READ_PARALLELISM 
= ConfigProperty
+      .key(GLUE_CLIENT_PROPERTY_PREFIX + "all_partitions_read_parallelism")
+      .defaultValue(1)
+      .markAdvanced()
+      .withValidValues(IntStream.rangeClosed(1, 
10).mapToObj(Integer::toString).toArray(String[]::new))
+      .sinceVersion("1.0.0")
+      .withDocumentation("Parallelism for listing all partitions(first time 
sync). Should be in interval [1, 10].");
+
+  public static final ConfigProperty<Integer> 
CHANGED_PARTITIONS_READ_PARALLELISM = ConfigProperty
+      .key(GLUE_CLIENT_PROPERTY_PREFIX + "changed_partitions_read_parallelism")
+      .defaultValue(1)
+      .markAdvanced()
+      .sinceVersion("1.0.0")
+      .withDocumentation("Parallelism for listing changed partitions(second 
and subsequent syncs).");

Review Comment:
   Yeah, because ALL_PARTITIONS_READ_PARALLELISM is 1-10 and uses 
[GetPartition](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html#aws-glue-api-catalog-partitions-GetPartition)
 - used for initial load and allows to split N initial partitions to up to 10 
segments and fetch them independently(basically same as without segments via 
continuationToken).
   While CHANGED_PARTITIONS_READ_PARALLELISM uses 
[BatchGetPartitions](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html#aws-glue-api-catalog-partitions-BatchGetPartition)
 - and the trick is that we can specify partitions we need(we know all from 
commit file) - while here, in theory, parallelism can be very high, likely user 
would like to limit it to a certain number to not face many retries. Basically 
- 1 request is 1000 partitions, it's highly unlikely someone is operating at 
very big scale, but still.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7466] Add parallel listing of existing partitions in Glue Catalog sync [hudi]

Reply via email to