[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

GitBox Thu, 22 Apr 2021 06:58:26 -0700


pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r618342707




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##########
@@ -192,7 +193,7 @@ protected void initializeIncomingRecordsMap() {
       long memoryForMerge = 
IOUtils.getMaxMemoryPerPartitionMerge(taskContextSupplier, config.getProps());
       LOG.info("MaxMemoryPerPartitionMerge => " + memoryForMerge);
       this.keyToNewRecords = new ExternalSpillableMap<>(memoryForMerge, 
config.getSpillableMapBasePath(),
-              new DefaultSizeEstimator(), new 
HoodieRecordSizeEstimator(writerSchema));
+          new DefaultSizeEstimator(), new 
HoodieRecordSizeEstimator(inputSchema));

Review comment:
       Hi @vinothchandar , The `inputSchema` is schema of the `input DataFrame` 
while the `writeSchema` is the schema write to the table. They are always the 
same except the case for `MergeInto`.
   
   For `MergeInto` the `inputSchema` may be different from the `writeSchema`. 
e.g.
   
   > create table h0(id int, name string) using hudi;
   > merge into h0 using (select 1 as id, 'a1' as name, 1 as flag ) s0
   > on h0.id = s0.id
   > when matched and flag = 1 then update set id = s0.id, name = s0.name
   
   In this case, The `inputSchema` is` id: int, name:string, flag:int` which is 
the schema of `s0`. But the writeSchema is
   `id:int, name:string`, the field `flag` is only used for condition test. So 
they are different.
   
   In order to solve this problem, we introduce `InputSchema` to distinguish 
`writeSchema`.  We use the `hoodie.write.schema` to specified the `writeSchema` 
if we want to distinguish them. If the `hoodie.write.schema` are not set, the 
two schema are the same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

Reply via email to