Re: [PR] [HUDI-9612] Update schema resolver to inspect files with inserts/updates [hudi]

via GitHub Mon, 21 Jul 2025 18:37:38 -0700


the-other-tim-brown commented on code in PR #13586:
URL: https://github.com/apache/hudi/pull/13586#discussion_r2220755797



##########
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##########
@@ -248,28 +250,27 @@ private Option<Schema> 
getTableSchemaFromCommitMetadata(HoodieInstant instant, b
    */
   private Option<Schema> getTableParquetSchemaFromDataFile() {
     Option<Pair<HoodieInstant, HoodieCommitMetadata>> instantAndCommitMetadata 
= getLatestCommitMetadataWithValidData();
-    try {
-      switch (metaClient.getTableType()) {
-        case COPY_ON_WRITE:
-        case MERGE_ON_READ:
-          // For COW table, data could be written in either Parquet or Orc 
format currently;
-          // For MOR table, data could be written in either Parquet, Orc, 
Hfile or Delta-log format currently;
-          //
-          // Determine the file format based on the file name, and then 
extract schema from it.
-          if (instantAndCommitMetadata.isPresent()) {
-            HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-            Iterator<String> filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
-            return Option.of(fetchSchemaFromFiles(filePaths));
-          } else {
-            LOG.warn("Could not find any data file written for commit, so 
could not get schema for table {}", metaClient.getBasePath());
-            return Option.empty();
-          }
-        default:
-          LOG.error("Unknown table type {}", metaClient.getTableType());
-          throw new InvalidTableException(metaClient.getBasePath().toString());
-      }
-    } catch (IOException e) {
-      throw new HoodieException("Failed to read data schema", e);
+    switch (metaClient.getTableType()) {
+      case COPY_ON_WRITE:
+      case MERGE_ON_READ:
+        // For COW table, data could be written in either Parquet or Orc 
format currently;
+        // For MOR table, data could be written in either Parquet, Orc, Hfile 
or Delta-log format currently;
+        //
+        // Determine the file format based on the file name, and then extract 
schema from it.
+        if (instantAndCommitMetadata.isPresent()) {
+          HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
+          // inspect non-empty files for schema
+          Stream<StoragePath> filePaths = 
commitMetadata.getPartitionToWriteStats().values().stream().flatMap(Collection::stream)
+              .filter(writeStat -> writeStat.getNumInserts() > 0 || 
writeStat.getNumUpdateWrites() > 0)

Review Comment:
   The check in `getLatestCommitMetadataWithValidData` is just that there is at 
least on file with numInserts or numUpdateWrites > 1. This check inspects the 
individual write stats so we do not try to read the schema from any files that 
do not have at least one insert/update. This prevents us from inspecting log 
files which are all deletes and don't yield a schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9612] Update schema resolver to inspect files with inserts/updates [hudi]

Reply via email to