Re: [PR] [HUDI-9365] Reduce overhead of Hive and AWS Glue sync tools [hudi]

via GitHub Thu, 22 May 2025 17:58:53 -0700


the-other-tim-brown commented on code in PR #13249:
URL: https://github.com/apache/hudi/pull/13249#discussion_r2103620780



##########
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java:
##########
@@ -233,19 +237,28 @@ protected void syncHoodieTable(String tableName, boolean 
useRealtimeInputFormat,
     LOG.info("Trying to sync hoodie table " + tableName + " with base path " + 
syncClient.getBasePath()
         + " of type " + syncClient.getTableType());
 
-    // create database if needed
-    checkAndCreateDatabase();
-
     final boolean tableExists = syncClient.tableExists(tableName);
-    // Get the parquet schema for this table looking at the latest commit
-    MessageType schema = 
syncClient.getStorageSchema(!config.getBoolean(HIVE_SYNC_OMIT_METADATA_FIELDS));
     // if table exists and location of the metastore table doesn't match the 
hoodie base path, recreate the table
     if (tableExists && 
!FSUtils.comparePathsWithoutScheme(syncClient.getBasePath(), 
syncClient.getTableLocation(tableName))) {
       LOG.info("basepath is updated for the table {}", tableName);
       recreateAndSyncHiveTable(tableName, useRealtimeInputFormat, 
readAsOptimized);
       return;
     }
 
+    // Check if any sync is required
+    if (tableExists && isIncrementalSync()) {
+      Option<String> lastCommitTimeSynced = 
syncClient.getLastCommitTimeSynced(tableName);
+      Option<String> lastCommitCompletionTimeSynced = 
syncClient.getLastCommitCompletionTimeSynced(tableName);
+      if (lastCommitTimeSynced.isPresent()) {
+        if (TimelineUtils.getCommitsTimelineAfter(syncClient.getMetaClient(), 
lastCommitTimeSynced.get(), lastCommitCompletionTimeSynced).countInstants() == 
0) {

Review Comment:
   No you are not really following the use case. You run the sync without 
waiting for the next commit so you ensure you are up to date with the latest 
commit when the application starts. When you are running this for hundreds of 
tables, this saves you a lot of API calls and a lot of startup time. Without 
this you will first fetch the schema by reading the timeline and one parquet 
file to get the operation type. Then you will make some more calls to Glue to 
sync the properties and schema which will be same since there are no new 
commits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9365] Reduce overhead of Hive and AWS Glue sync tools [hudi]

Reply via email to