[GitHub] [hudi] aditiwari01 commented on a change in pull request #4468: [Issue: #2802] Fixing Hive getSchema for RT tables

GitBox Thu, 30 Dec 2021 01:11:28 -0800


aditiwari01 commented on a change in pull request #4468:
URL: https://github.com/apache/hudi/pull/4468#discussion_r776632909




##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
##########
@@ -77,19 +74,17 @@ private boolean usesCustomPayload() {
   }
 
   /**
-   * Goes through the log files in reverse order and finds the schema from the 
last available data block. If not, falls
+   * Gets schema from HoodieTableMetaClient. If not, falls
    * back to the schema from the latest parquet file. Finally, sets the 
partition column and projection fields into the
    * job conf.
    */
-  private void init() throws IOException {
-    Schema schemaFromLogFile = 
LogReaderUtils.readLatestSchemaFromLogFiles(split.getBasePath(), 
split.getDeltaLogFiles(), jobConf);
-    if (schemaFromLogFile == null) {
-      writerSchema = InputSplitUtils.getBaseFileSchema((FileSplit)split, 
jobConf);
-      LOG.info("Writer Schema From Parquet => " + writerSchema.getFields());
-    } else {
-      writerSchema = schemaFromLogFile;
-      LOG.info("Writer Schema From Log => " + writerSchema.toString(true));
-    }
+  private void init() throws Exception {
+
+    HoodieTableMetaClient metaClient = 
HoodieTableMetaClient.builder().setConf(split.getPath().getFileSystem(jobConf).getConf()).setBasePath(split.getBasePath()).build();
+    TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient);

Review comment:
       @nsivabalan The idea behind this is that the TableSchemaResolver will 
provide us the latest schema. Some partitions might not have few columns that 
are present in main schema, but those will be set as NULL for that partition.
   
   The issue with current logic is that it gets different schema for different 
partition which later conflict if multiple partitions are read at the same time.
   
   @xiarixiaoyao As for the TODO mentioned, I'm not exactly sure about the 
complete context, but with this patch it nulls out the fields not present 
before (i.e. not in the partition schema but is there in latest schema)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] aditiwari01 commented on a change in pull request #4468: [Issue: #2802] Fixing Hive getSchema for RT tables

Reply via email to