[PR] [MINOR] Fix default value of parquet writeLegacyFormat [hudi]

via GitHub Thu, 07 Dec 2023 00:17:10 -0800


voonhous opened a new pull request, #10267:
URL: https://github.com/apache/hudi/pull/10267


   ### Change Logs
   
   Goal of this PR is to correct the default value of 
`parquet.writeLegacyFormat` to avoid confusion.
   
   The `parquetWriteLegacyFormat` property mainly affects how the decimal type 
is being written in parquet files. 
   
   The detailed implementation of how decimal types are realised is defined in 
this document here:
   
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
   
   The summary is:
   1. If `parquetWriteLegacyFormat==true` decimal will be written as 
`fixed_len_byte_array`
   2. If `parquetWriteLegacyFormat==false` decimal may be written as `INT32` / 
`INT64` / `fixed_len_byte_array` depending on the precision
   
   
https://github.com/apache/hudi/blob/0bbfc0754b051490450b9484b69e2bc708ec475b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L322-L339
   
   #### Example
   
   ```scala
     test("Test COW decimal overwrite with legacy format issue") {
       withRecordType()(withTempDir { tmp =>
         val tableName = generateTableName
         spark.sql(
           s"""
              |create table $tableName (
              |  id int,
              |  name string,
              |  price decimal(6, 0),
              |  ts long
              |) using hudi
              | location '${tmp.getCanonicalPath}'
              | tblproperties (
              |  primaryKey ='id',
              |  type = 'cow',
              |  preCombineField = 'ts',
              |  hoodie.bucket.index.num.buckets        = 2,
              |  hoodie.index.bucket.engine = 'SIMPLE',
              |  hoodie.storage.layout.partitioner.class        = 
'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
              | )
            """.stripMargin)
         spark.sql("set hoodie.datasource.write.operation=bulk_insert")
         // if line below is commented, hudi will use writeLegacyFormat=true 
despite the default behaviour being false
         spark.sql("set hoodie.parquet.writelegacyformat.enabled=false")
         spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
         spark.sql(s"select * from $tableName")
       })
     }
   ```
   
   Running the above code-snippet, under two conditions, one with:
   
   ```scala
   spark.sql("set hoodie.parquet.writelegacyformat.enabled=false")
   ```
   
   And another without the above code snippet (commenting it out)
   
   We can see that the default value is clearly not used  when inspecting the 
underlying schema of the parquet file using `parquet-tools`:
   
   ##### Enforced parquet.writeLegacyFormat
   
   ```txt
   ############ Column(price) ############
   name: price
   path: price
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT32
   logical_type: Decimal(precision=6, scale=0)
   converted_type (legacy): DECIMAL
   compression: GZIP (space_saved: -55%)
   ```
   
   ##### Unenforced parquet.writeLegacyFormat
   
   ```txt
   ############ Column(price) ############
   name: price
   path: price
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: FIXED_LEN_BYTE_ARRAY
   logical_type: Decimal(precision=6, scale=0)
   converted_type (legacy): DECIMAL
   compression: GZIP (space_saved: -62%)
   ```
   
   ```mermaid
   flowchart TD
       A[Start] --> B{{"set <br> 
hoodie.parquet.writelegacyformat.enabled=true|false <br> defined?"}}
       B -->|Yes| C{config value <br> is true?}
       B -->|"No [default]"| D[writeLegacyFormat.enabled=hudi-enforced]
       C -->|Yes| E[writeLegacyFormat.enabled=true]
       C -->|No| F[writeLegacyFormat.enabled=false]
   ```
   
   As can be seen, `parquet.writeLegacyFormat` will only be enforced when 
explicitly set, else, it is up to Hudi's internal logic to decide what to use
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   Will need to update the config description of asf-site. However
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
     ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [MINOR] Fix default value of parquet writeLegacyFormat [hudi]

Reply via email to