voonhous opened a new pull request, #10267: URL: https://github.com/apache/hudi/pull/10267
### Change Logs Goal of this PR is to correct the default value of `parquet.writeLegacyFormat` to avoid confusion. The `parquetWriteLegacyFormat` property mainly affects how the decimal type is being written in parquet files. The detailed implementation of how decimal types are realised is defined in this document here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal The summary is: 1. If `parquetWriteLegacyFormat==true` decimal will be written as `fixed_len_byte_array` 2. If `parquetWriteLegacyFormat==false` decimal may be written as `INT32` / `INT64` / `fixed_len_byte_array` depending on the precision https://github.com/apache/hudi/blob/0bbfc0754b051490450b9484b69e2bc708ec475b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L322-L339 #### Example ```scala test("Test COW decimal overwrite with legacy format issue") { withRecordType()(withTempDir { tmp => val tableName = generateTableName spark.sql( s""" |create table $tableName ( | id int, | name string, | price decimal(6, 0), | ts long |) using hudi | location '${tmp.getCanonicalPath}' | tblproperties ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.bucket.index.num.buckets = 2, | hoodie.index.bucket.engine = 'SIMPLE', | hoodie.storage.layout.partitioner.class = 'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner' | ) """.stripMargin) spark.sql("set hoodie.datasource.write.operation=bulk_insert") // if line below is commented, hudi will use writeLegacyFormat=true despite the default behaviour being false spark.sql("set hoodie.parquet.writelegacyformat.enabled=false") spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") spark.sql(s"select * from $tableName") }) } ``` Running the above code-snippet, under two conditions, one with: ```scala spark.sql("set hoodie.parquet.writelegacyformat.enabled=false") ``` And another without the above code snippet (commenting it out) We can see that the default value is clearly not used when inspecting the underlying schema of the parquet file using `parquet-tools`: ##### Enforced parquet.writeLegacyFormat ```txt ############ Column(price) ############ name: price path: price max_definition_level: 1 max_repetition_level: 0 physical_type: INT32 logical_type: Decimal(precision=6, scale=0) converted_type (legacy): DECIMAL compression: GZIP (space_saved: -55%) ``` ##### Unenforced parquet.writeLegacyFormat ```txt ############ Column(price) ############ name: price path: price max_definition_level: 1 max_repetition_level: 0 physical_type: FIXED_LEN_BYTE_ARRAY logical_type: Decimal(precision=6, scale=0) converted_type (legacy): DECIMAL compression: GZIP (space_saved: -62%) ``` ```mermaid flowchart TD A[Start] --> B{{"set <br> hoodie.parquet.writelegacyformat.enabled=true|false <br> defined?"}} B -->|Yes| C{config value <br> is true?} B -->|"No [default]"| D[writeLegacyFormat.enabled=hudi-enforced] C -->|Yes| E[writeLegacyFormat.enabled=true] C -->|No| F[writeLegacyFormat.enabled=false] ``` As can be seen, `parquet.writeLegacyFormat` will only be enforced when explicitly set, else, it is up to Hudi's internal logic to decide what to use ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update Will need to update the config description of asf-site. However - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
