harshagudladona commented on issue #14070:
URL: https://github.com/apache/hudi/issues/14070#issuecomment-3402918681

   We have not done any modifications to the hudi codebase. However, we have 
implemented a schema provider that transforms proto to avro schema. 
   
   The use case is very straightforward. We have a simple COW/MOR table with 
INSERT and in-line small file handling, meaning append to one of the small 
files. 
   
   We think there is a computationally intensive schema equality check that is 
new in 1.0.2, which is slowing down the record writes in 1.0.2. Like I 
mentioned earlier, in HoodieMergeHelper the reader schema is derived from 
   
   `Schema readerSchema = baseFileReader.getSchema();`
   
   This is a conversion of MessageType (parquet schema) to avro schema, that is 
along the lines of, say
   
   `Schema avroFromParquetSchema = new 
AvroSchemaConverter(table.getStorage().getConf().unwrapAs(Configuration.class)).convert(parquetSchema);`
   
   The method isStrictProjectionOf accepts this avroFromParquetSchema and the 
writer schema, which are rarely equal. This will cause the merge operation to 
almost always rewrite the record unnecessarily. 
   
   I cannot think of a way to provide reproducible way that simulates our 
setup. The test setup is exactly the same as the one described in this issue 
https://github.com/apache/hudi/issues/13995. We have run ingestion from the 
same source on the same data from 0.14 and 1.0.2, and we see that 1.0.2 takes 
longer to finish, and in each iteration of processing records, the write stage 
takes longer(20% longer) in 1.0.2.
   
   With this being said, I am working on changing how the input schemas are 
provided to isStrictProjectionOf slightly to see if this can help avoid 
unnecessary rewrites. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to