[ 
https://issues.apache.org/jira/browse/HUDI-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914822#comment-17914822
 ] 

Shuo Cheng commented on HUDI-8004:
----------------------------------

Hi, [~kongwei], are you still working on this, if not, I'll take the ticket.

> optimize the write handle proformance
> -------------------------------------
>
>                 Key: HUDI-8004
>                 URL: https://issues.apache.org/jira/browse/HUDI-8004
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: performance, writer-core
>            Reporter: Kong Wei
>            Priority: Major
>             Fix For: 1.0.1
>
>         Attachments: image-2024-07-19-11-43-41-388.png, 
> image-2024-07-19-11-45-55-460.png, image-2024-07-19-11-47-09-873.png, 
> image-2024-07-19-12-02-35-729.png
>
>
> The backgroud:
> When merging records with base file, the 
> org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if 
> the  writerSchema(from write client) is a strict projection of the 
> readerSchema(from base file),
> !image-2024-07-19-11-43-41-388.png!
> if true (ignore other conditions), we can {*}skip rewriting record with new 
> schema{*}, which can save about 16% CPU
> !image-2024-07-19-11-45-55-460.png!
> Here is my CPU sampling
> !image-2024-07-19-11-47-09-873.png!
>  
> The issue is:
> When checking the strict projection condition, if the reader schema has a 
> `Fixed` type, the writerSchema will never be a projection of readerSchema.
> Because the readerSchema(from base parquet file) convert the Fixed field to 
> {code:java}
>         {
>             "name": "ctime",
>             "type": [
>                 "null",
>                 {
>                     "type": "fixed",
>                     "name": "ctime",
>                     "namespace": "",
>                     "size": 9,
>                     "logicalType": "decimal",
>                     "precision": 20,
>                     "scale": 0
>                 }
>             ],
>             "default": null
>         }, {code}
>  
> while the writeSchema(from write client) convert the Fixed field to 
> {code:java}
>         {
>             "name": "ctime",
>             "type": [
>                 "null",
>                 {
>                     "type": "fixed",
>                     "name": "fixed",
>                     "namespace": "test_db.test_table.ctime",
>                     "size": 9,
>                     "logicalType": "decimal",
>                     "precision": 20,
>                     "scale": 0
>                 }
>             ],
>             "default": null
>         }, {code}
>  
> The name for inner Fixed type is different, so even if both schema should be 
> the same, the strict projection check will return false.
>  
> I checked the cause for above inconsistence, it is in 
> org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal 
> to fixed
> !image-2024-07-19-12-02-35-729.png!
> And the code seems is copied from spark-avro code by 
> [HUDI-3549|https://github.com/apache/hudi/pull/4955]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to