[jira] [Updated] (HUDI-8004) optimize the write handle proformance

Kong Wei (Jira) Thu, 18 Jul 2024 21:08:03 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kong Wei updated HUDI-8004:
---------------------------
    Description: 
The backgroud:

When merging records with base file, the 
org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if 
the  writerSchema(from write client) is a strict projection of the 
readerSchema(from base file),

!image-2024-07-19-11-43-41-388.png!

if true (ignore other conditions), we can {*}skip rewriting record with new 
schema{*}, which can save about 16% CPU

!image-2024-07-19-11-45-55-460.png!

Here is my CPU sampling

!image-2024-07-19-11-47-09-873.png!

 

The issue is:

When checking the strict projection condition, if the reader schema has a 
`Fixed` type, the writerSchema will never be a projection of readerSchema.

Because the readerSchema(from base parquet file) convert the Fixed field to 
{code:java}
        {
            "name": "ctime",
            "type": [
                "null",
                {
                    "type": "fixed",
                    "name": "ctime",
                    "namespace": "",
                    "size": 9,
                    "logicalType": "decimal",
                    "precision": 20,
                    "scale": 0
                }
            ],
            "default": null
        }, {code}
 

while the writeSchema(from write client) convert the Fixed field to 
{code:java}
        {
            "name": "ctime",
            "type": [
                "null",
                {
                    "type": "fixed",
                    "name": "fixed",
                    "namespace": "test_db.test_table.ctime",
                    "size": 9,
                    "logicalType": "decimal",
                    "precision": 20,
                    "scale": 0
                }
            ],
            "default": null
        }, {code}
 

The name for inner Fixed type is different, so even if both schema should be 
the same, the strict projection check will return false.

 

I checked the cause for above inconsistence, it is in 
org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal to 
fixed

!image-2024-07-19-12-02-35-729.png!

And the code seems is copied from spark-avro code by 
[HUDI-3549|https://github.com/apache/hudi/pull/4955]

  was:
The backgroud:

When merging records with base file, the 
org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if 
the  writerSchema(from write client) is a strict projection of the 
readerSchema(from base file),

!image-2024-07-19-11-43-41-388.png!

if true (ignore other conditions), we can {*}skip rewriting record with new 
schema{*}, which can save about 16% CPU

!image-2024-07-19-11-45-55-460.png!

Here is my CPU sampling

!image-2024-07-19-11-47-09-873.png!

 

The issue is:

When checking the strict projection condition, if the reader schema has a 
`Fixed` type, the writerSchema will never be a projection of readerSchema.

Because the readerSchema(from base parquet file) convert the Fixed field to 

```

        {
            "name": "ctime",
            "type": [
                "null",
                {
                    "type": "fixed",
                    "name": "ctime",
                    "namespace": "",
                    "size": 9,
                    "logicalType": "decimal",
                    "precision": 20,
                    "scale": 0
                }
            ],
            "default": null
        },

```

while the writeSchema(from write client) convert the Fixed field to 

```

        {
            "name": "ctime",
            "type": [
                "null",
                {
                    "type": "fixed",
                    "name": "fixed",
                    "namespace": "test_db.test_table.ctime",
                    "size": 9,
                    "logicalType": "decimal",
                    "precision": 20,
                    "scale": 0
                }
            ],
            "default": null
        },

```

 

The name for inner Fixed type is different, so even if both schema should be 
the same, the strict projection check will return false.

 

I checked the cause for above inconsistence, it is in 
org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal to 
fixed

!image-2024-07-19-12-02-35-729.png!

And the code seems is copied from spark-avro code by 
[HUDI-3549|https://github.com/apache/hudi/pull/4955]


> optimize the write handle proformance
> -------------------------------------
>
>                 Key: HUDI-8004
>                 URL: https://issues.apache.org/jira/browse/HUDI-8004
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: performance, writer-core
>            Reporter: Kong Wei
>            Priority: Major
>         Attachments: image-2024-07-19-11-43-41-388.png, 
> image-2024-07-19-11-45-55-460.png, image-2024-07-19-11-47-09-873.png, 
> image-2024-07-19-12-02-35-729.png
>
>
> The backgroud:
> When merging records with base file, the 
> org.apache.hudi.table.action.commit.HoodieMergeHelper#runMerge will check if 
> the  writerSchema(from write client) is a strict projection of the 
> readerSchema(from base file),
> !image-2024-07-19-11-43-41-388.png!
> if true (ignore other conditions), we can {*}skip rewriting record with new 
> schema{*}, which can save about 16% CPU
> !image-2024-07-19-11-45-55-460.png!
> Here is my CPU sampling
> !image-2024-07-19-11-47-09-873.png!
>  
> The issue is:
> When checking the strict projection condition, if the reader schema has a 
> `Fixed` type, the writerSchema will never be a projection of readerSchema.
> Because the readerSchema(from base parquet file) convert the Fixed field to 
> {code:java}
>         {
>             "name": "ctime",
>             "type": [
>                 "null",
>                 {
>                     "type": "fixed",
>                     "name": "ctime",
>                     "namespace": "",
>                     "size": 9,
>                     "logicalType": "decimal",
>                     "precision": 20,
>                     "scale": 0
>                 }
>             ],
>             "default": null
>         }, {code}
>  
> while the writeSchema(from write client) convert the Fixed field to 
> {code:java}
>         {
>             "name": "ctime",
>             "type": [
>                 "null",
>                 {
>                     "type": "fixed",
>                     "name": "fixed",
>                     "namespace": "test_db.test_table.ctime",
>                     "size": 9,
>                     "logicalType": "decimal",
>                     "precision": 20,
>                     "scale": 0
>                 }
>             ],
>             "default": null
>         }, {code}
>  
> The name for inner Fixed type is different, so even if both schema should be 
> the same, the strict projection check will return false.
>  
> I checked the cause for above inconsistence, it is in 
> org.apache.spark.sql.avro.SchemaConverters#toAvroType which convert decimal 
> to fixed
> !image-2024-07-19-12-02-35-729.png!
> And the code seems is copied from spark-avro code by 
> [HUDI-3549|https://github.com/apache/hudi/pull/4955]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8004) optimize the write handle proformance

Reply via email to