[
https://issues.apache.org/jira/browse/HUDI-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen closed HUDI-8004.
----------------------------
Reviewers: Danny Chen
Resolution: Fixed
Fixed via master branch: 02472c91aac1892d76602795c3f816b58e9c90f7
> optimize the write handle proformance
> -------------------------------------
>
> Key: HUDI-8004
> URL: https://issues.apache.org/jira/browse/HUDI-8004
> Project: Apache Hudi
> Issue Type: Improvement
> Components: performance, writer-core
> Reporter: Kong Wei
> Assignee: Shuo Cheng
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.1
>
> Attachments: image-2024-07-19-11-43-41-388.png,
> image-2024-07-19-11-45-55-460.png, image-2024-07-19-11-47-09-873.png,
> image-2024-07-19-12-02-35-729.png
>
>
> The backgroud:
> When merging records with base file, the
> org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if
> the writerSchema(from write client) is a strict projection of the
> readerSchema(from base file),
> !image-2024-07-19-11-43-41-388.png!
> if true (ignore other conditions), we can {*}skip rewriting record with new
> schema{*}, which can save about 16% CPU
> !image-2024-07-19-11-45-55-460.png!
> Here is my CPU sampling
> !image-2024-07-19-11-47-09-873.png!
>
> The issue is:
> When checking the strict projection condition, if the reader schema has a
> `Fixed` type, the writerSchema will never be a projection of readerSchema.
> Because the readerSchema(from base parquet file) convert the Fixed field to
> {code:java}
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "ctime",
> "namespace": "",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> }, {code}
>
> while the writeSchema(from write client) convert the Fixed field to
> {code:java}
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "fixed",
> "namespace": "test_db.test_table.ctime",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> }, {code}
>
> The name for inner Fixed type is different, so even if both schema should be
> the same, the strict projection check will return false.
>
> I checked the cause for above inconsistence, it is in
> org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal
> to fixed
> !image-2024-07-19-12-02-35-729.png!
> And the code seems is copied from spark-avro code by
> [HUDI-3549|https://github.com/apache/hudi/pull/4955]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)