[
https://issues.apache.org/jira/browse/HUDI-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kong Wei updated HUDI-8004:
---------------------------
Description:
The backgroud:
When merging records with base file, the
org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if
the writerSchema(from write client) is a strict projection of the
readerSchema(from base file),
!image-2024-07-19-11-43-41-388.png!
if true (ignore other conditions), we can {*}skip rewriting record with new
schema{*}, which can save about 16% CPU
!image-2024-07-19-11-45-55-460.png!
Here is my CPU sampling
!image-2024-07-19-11-47-09-873.png!
The issue is:
When checking the strict projection condition, if the reader schema has a
`Fixed` type, the writerSchema will never be a projection of readerSchema.
Because the readerSchema(from base parquet file) convert the Fixed field to
```
{
"name": "ctime",
"type": [
"null",
{
"type": "fixed",
"name": "ctime",
"namespace": "",
"size": 9,
"logicalType": "decimal",
"precision": 20,
"scale": 0
}
],
"default": null
},
```
while the writeSchema(from write client) convert the Fixed field to
```
{
"name": "ctime",
"type": [
"null",
{
"type": "fixed",
"name": "fixed",
"namespace": "test_db.test_table.ctime",
"size": 9,
"logicalType": "decimal",
"precision": 20,
"scale": 0
}
],
"default": null
},
```
The name for inner Fixed type is different, so even if both schema should be
the same, the strict projection check will return false.
I checked the cause for above inconsistence, it is in
org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal to
fixed
!image-2024-07-19-12-02-35-729.png!
And the code seems is copied from spark-avro code by
[HUDI-3549|https://github.com/apache/hudi/pull/4955]
was:
> optimize the write handle proformance
> -------------------------------------
>
> Key: HUDI-8004
> URL: https://issues.apache.org/jira/browse/HUDI-8004
> Project: Apache Hudi
> Issue Type: Improvement
> Components: performance, writer-core
> Reporter: Kong Wei
> Priority: Major
> Attachments: image-2024-07-19-11-43-41-388.png,
> image-2024-07-19-11-45-55-460.png, image-2024-07-19-11-47-09-873.png,
> image-2024-07-19-12-02-35-729.png
>
>
> The backgroud:
> When merging records with base file, the
> org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if
> the writerSchema(from write client) is a strict projection of the
> readerSchema(from base file),
> !image-2024-07-19-11-43-41-388.png!
> if true (ignore other conditions), we can {*}skip rewriting record with new
> schema{*}, which can save about 16% CPU
> !image-2024-07-19-11-45-55-460.png!
> Here is my CPU sampling
> !image-2024-07-19-11-47-09-873.png!
>
> The issue is:
> When checking the strict projection condition, if the reader schema has a
> `Fixed` type, the writerSchema will never be a projection of readerSchema.
> Because the readerSchema(from base parquet file) convert the Fixed field to
> ```
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "ctime",
> "namespace": "",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> },
> ```
> while the writeSchema(from write client) convert the Fixed field to
> ```
> {
> "name": "ctime",
> "type": [
> "null",
> {
> "type": "fixed",
> "name": "fixed",
> "namespace": "test_db.test_table.ctime",
> "size": 9,
> "logicalType": "decimal",
> "precision": 20,
> "scale": 0
> }
> ],
> "default": null
> },
> ```
>
> The name for inner Fixed type is different, so even if both schema should be
> the same, the strict projection check will return false.
>
> I checked the cause for above inconsistence, it is in
> org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal
> to fixed
> !image-2024-07-19-12-02-35-729.png!
> And the code seems is copied from spark-avro code by
> [HUDI-3549|https://github.com/apache/hudi/pull/4955]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)