pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r618342707
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##########
@@ -192,7 +193,7 @@ protected void initializeIncomingRecordsMap() {
long memoryForMerge =
IOUtils.getMaxMemoryPerPartitionMerge(taskContextSupplier, config.getProps());
LOG.info("MaxMemoryPerPartitionMerge => " + memoryForMerge);
this.keyToNewRecords = new ExternalSpillableMap<>(memoryForMerge,
config.getSpillableMapBasePath(),
- new DefaultSizeEstimator(), new
HoodieRecordSizeEstimator(writerSchema));
+ new DefaultSizeEstimator(), new
HoodieRecordSizeEstimator(inputSchema));
Review comment:
Hi @vinothchandar , The `inputSchema` is schema of the `input DataFrame`
while the `writeSchema` is the schema write to the table. They are always the
same except the case for `MergeInto`.
For `MergeInto` the `inputSchema` may be different from the `writeSchema`.
e.g.
> create table h0(id int, name string) using hudi;
> merge into h0 using (select 1 as id, 'a1' as name, 1 as flag ) s0
> on h0.id = s0.id
> when matched and flag = 1 then update set id = s0.id, name = s0.name
In this case, The `inputSchema` is` id: int, name:string, flag:int` which is
the schema of `s0`. But the writeSchema is
`id:int, name:string`, the field `flag` is only used for condition test. So
they are different.
In order to solve this problem, we introduce `InputSchema` to distinguish
`writeSchema`. We use the `hoodie.write.schema` to specified the `writeSchema`
if we want to distinguish them. If the `hoodie.write.schema` are not set, the
two schema are the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]