[
https://issues.apache.org/jira/browse/HUDI-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914822#comment-17914822
]
Shuo Cheng commented on HUDI-8004:
----------------------------------
Hi, [~kongwei], are you still working on this, if not, I'll take the ticket.
> optimize the write handle proformance
> -------------------------------------
>
> Key: HUDI-8004
> URL: https://issues.apache.org/jira/browse/HUDI-8004
> Project: Apache Hudi
> Issue Type: Improvement
> Components: performance, writer-core
> Reporter: Kong Wei
> Priority: Major
> Fix For: 1.0.1
>
> Attachments: image-2024-07-19-11-43-41-388.png,
> image-2024-07-19-11-45-55-460.png, image-2024-07-19-11-47-09-873.png,
> image-2024-07-19-12-02-35-729.png
>
>
> The backgroud:
> When merging records with base file, the
> org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if
> the writerSchema(from write client) is a strict projection of the
> readerSchema(from base file),
> !image-2024-07-19-11-43-41-388.png!
> if true (ignore other conditions), we can {*}skip rewriting record with new
> schema{*}, which can save about 16% CPU
> !image-2024-07-19-11-45-55-460.png!
> Here is my CPU sampling
> !image-2024-07-19-11-47-09-873.png!
>
> The issue is:
> When checking the strict projection condition, if the reader schema has a
> `Fixed` type, the writerSchema will never be a projection of readerSchema.
> Because the readerSchema(from base parquet file) convert the Fixed field to
> {code:java}
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "ctime",
> "namespace": "",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> }, {code}
>
> while the writeSchema(from write client) convert the Fixed field to
> {code:java}
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "fixed",
> "namespace": "test_db.test_table.ctime",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> }, {code}
>
> The name for inner Fixed type is different, so even if both schema should be
> the same, the strict projection check will return false.
>
> I checked the cause for above inconsistence, it is in
> org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal
> to fixed
> !image-2024-07-19-12-02-35-729.png!
> And the code seems is copied from spark-avro code by
> [HUDI-3549|https://github.com/apache/hudi/pull/4955]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)