hudi-agent commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3246191076


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -1382,6 +1387,22 @@ public Map<String, String> getTableMergeProperties() {
     return configs;
   }
 
+  /**
+   * Enables or disables the bucketing for file groups within the metadata 
table partitions.
+   *
+   * @param enable Whether to enable or disable bucketing
+   */
+  public void setMetadataTableBucketing(boolean enable) {

Review Comment:
   🤖 nit: the setter is `setMetadataTableBucketing` but the getter is 
`isMetadataTablePartitionBucketingEnabled` — the missing `Partition` makes the 
pair feel like they target different things. Could you rename the setter to 
`setMetadataTablePartitionBucketing` for symmetry?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2294,6 +2311,93 @@ public static String 
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
     return null;
   }
 
+  /**

Review Comment:
   🤖 While tracing the new bucketing flow, `HoodieMetadataMetrics.getStats` 
(line 110 of that file) still calls 
`fsView.getLatestFileSlices(metadataPartition)` directly on raw MDT partition 
names like `files`/`record_index`. Once bucketing is enabled the file groups 
live under `<partition>/<bucket>/`, so this returns empty and every 
`totalBaseFileSizeInBytes` / `totalLogFileSizeInBytes` / `baseFileCount` / 
`logFileCount` metric reports 0 after each commit (`updateSizeMetrics` is 
invoked from `HoodieBackedTableMetadataWriter` at line 1932). Could you route 
that loop through `getPartitionLatestFileSlices` (or otherwise reuse 
`getPartitionsToList`) so the metrics stay meaningful when bucketing is on?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -687,6 +687,16 @@ public long getMaxLogFileSize() {
     return getLong(MAX_LOG_FILE_SIZE_BYTES_PROP);
   }
 
+  public static final ConfigProperty<Boolean> FILE_GROUP_BUCKETING_ENABLE = 
ConfigProperty
+          .key(METADATA_PREFIX + ".file.group.bucketing.enable")
+          .defaultValue(false)
+          .withDocumentation("This flag indicates whether there should be 
intermediate partitions under partitions such as record index and filelisting");
+
+  public static final ConfigProperty<Integer> FILE_GROUP_BUCKET_SIZE = 
ConfigProperty
+          .key(METADATA_PREFIX + ".file.group.bucket.size")
+          .defaultValue(1000)
+          .withDocumentation("This is paired with " + 
FILE_GROUP_BUCKETING_ENABLE.key() + ". This represents number of shards under a 
single bucket");

Review Comment:
   🤖 nit: the doc says "number of shards under a single bucket" — 'shards' 
isn't used anywhere else in this context and 'under a single bucket' is 
backwards (the size defines the bucket, not what's inside one). Something like 
"Maximum number of file groups per bucket sub-directory" would be clearer.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2294,6 +2311,93 @@ public static String 
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
     return null;
   }
 
+  /**
+   * Returns true if bucketing is enabled for the metadata table.
+   *
+   * <p>Reads the property from the MDT's own table config so we avoid an 
extra disk read of the
+   * data table's hoodie.properties on every file-slice lookup. The property 
is persisted on both
+   * the data table and the MDT during MDT initialization (see
+   * {@link #setMetadataTablePartitionBucketing(HoodieTableMetaClient, 
HoodieTableMetaClient, boolean)}).
+   *
+   * @param metaClient MetaClient for the metadata table
+   * @return true if bucketing is enabled
+   */
+  private static boolean isBucketingEnabledForMDT(HoodieTableMetaClient 
metaClient) {
+    return 
metaClient.getTableConfig().isMetadataTablePartitionBucketingEnabled();
+  }
+
+  /**
+   * Returns the list of partitions to list for file slices.
+   * When bucketing is enabled, returns the bucket sub-directories.
+   * When bucketing is disabled, returns the partition itself.
+   *
+   * <p>Note: bucketing is fixed at MDT initialization time (see
+   * {@link 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#initializeFileGroups}).
 A given
+   * MDT is therefore either fully bucketed or fully non-bucketed; mixed-state 
file groups under
+   * partition/ alongside partition/&lt;bucket&gt;/ cannot occur. We 
deliberately do not also scan the
+   * partition root when bucketing is enabled.
+   *
+   * @param metaClient MetaClient for the metadata table
+   * @param partition The partition path
+   * @param bucketingEnabled Whether bucketing is enabled
+   * @return List of partition paths to list
+   */
+  private static List<String> getPartitionsToList(HoodieTableMetaClient 
metaClient, String partition, boolean bucketingEnabled) {
+    List<String> partitionsToList = new ArrayList<>();
+    try {
+      if (bucketingEnabled) {
+        // Find all the buckets in the partition
+        StoragePath partitionPath = new StoragePath(metaClient.getBasePath(), 
partition);
+        List<StoragePathInfo> statuses = 
metaClient.getStorage().listDirectEntries(partitionPath);
+        for (StoragePathInfo status : statuses) {
+          if (status.isDirectory()) {
+            partitionsToList.add(partition + StoragePath.SEPARATOR + 
status.getPath().getName());
+          }
+        }
+        log.info("Listing {} buckets in metadata table partition {}", 
partitionsToList.size(), partition);
+      } else {
+        // Only list the partition folder as bucketing is not enabled
+        partitionsToList.add(partition);
+        log.info("Listing non-bucketed metadata table partition {}", 
partition);
+      }
+    } catch (IOException e) {
+      throw new HoodieMetadataException("Failed to check for bucketing in 
metadata table partition " + partition, e);
+    }
+    return partitionsToList;
+  }
+
+  /**
+   * Set the bucketing of partitions within the metadata table.
+   *
+   * <p>The property is persisted on both the data table (authoritative source 
for MDT initialization
+   * decisions) and the MDT (to allow reader code to determine bucketing 
without loading data table
+   * properties on every call).
+   *
+   * @param dataMetaClient     MetaClient for the dataset
+   * @param metadataMetaClient MetaClient for the MDT (may be null if MDT is 
not yet initialized)
+   * @param enabled            If true, metadata table is being used for this 
dataset, false otherwise

Review Comment:
   🤖 nit: the `@param enabled` doc looks copy-pasted — "If true, metadata table 
is being used for this dataset" describes availability, not bucketing. Could 
you update it to something like "If true, enable bucketing for MDT partitions; 
if false, disable it"?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -1370,6 +1390,16 @@ public HoodieMetadataConfig build() {
       return metadataConfig;
     }
 
+    public Builder withFileGroupBucketingEnable(boolean 
fileGroupBucketingEnable) {

Review Comment:
   🤖 nit: could you rename `withFileGroupBucketingEnable` to 
`withFileGroupBucketingEnabled`? The rest of the builder API uses past-tense 
adjectives for boolean state (e.g. `withMetadataIndexEnabled`), and `Enable` as 
a suffix reads more like a verb.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to