[GitHub] [hudi] askldjd opened a new issue #4233: [SUPPORT] Hudi parquet INT16 conversion loses LogicalType information

GitBox Mon, 06 Dec 2021 14:14:27 -0800


askldjd opened a new issue #4233:
URL: https://github.com/apache/hudi/issues/4233



   **Describe the problem you faced**
   
   Our source parquet file has an INT16 column that contains a mixture of 
positive and negative values. When we convert the parquet file to the hudi 
format, the INT16 column in the hudi parquet loses its 
[LogicalType](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
 information and is interpreted as an INT32. That indirectly causes all 
negative values to be interpreted as unsigned value. (E.g. -2 becomes 65534).
   
   **To Reproduce**
   
   Here's the python code that demonstrate this behavior. I created a Pandas 
dataframe with a INT16 column called "short_column" and converted it to hudi 
through pyspark.
   
   ```py
   import pandas as pd
   from pyspark.sql import SparkSession
   
   spark = SparkSession.builder.master("local").getOrCreate()
   parquet_dir = "/tmp/test.parquet"
   
   data = pd.DataFrame({
       '_ts':pd.Series([100], dtype='int32'),
       'id':pd.Series([999], dtype='int32'),
       'short_column':pd.Series([-2], dtype='int16'),
   })
   
   data.to_parquet(parquet_dir)
   
   hudi_dir =  "/tmp/output_hudi"
   
   read_df = spark.read.parquet(parquet_dir)
   
   hudi_options = {
       "hoodie.table.name": 'test_table',
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.precombine.field": "_ts",
       "hoodie.datasource.write.partitionpath.field": "",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.CustomKeyGenerator",
       "hoodie.upsert.shuffle.parallelism": 1500,
       "hoodie.insert.shuffle.parallelism": 1500,
       "hoodie.consistency.check.enabled": True,
       "hoodie.index.type": "BLOOM",
       "hoodie.index.bloom.num_entries": 60000,
       "hoodie.index.bloom.fpp": 0.000000001,
       "hoodie.cleaner.commits.retained": 2,
   }
   (       
       read_df.write.format("org.apache.hudi")
           .options(**hudi_options)
           .mode("overwrite")
           .save(hudi_dir)
   )
   ```
   
   **Expected behavior**
   Using **[parquet-tools](https://pypi.org/project/parquet-tools/),** I can 
extract the schema from the original and hudi parquet.
   
   The original "short_column" has the following definition:
   ```
   ############ Column(short_column) ############
   name: short_column
   path: short_column
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT32
   logical_type: Int(bitWidth=16, isSigned=true)
   converted_type (legacy): INT_16
   ```
   The original "short_column" has the value -2.
   
   ```
   +-------+------+----------------+
   |   _ts |   id |   short_column |
   |-------+------+----------------|
   |   100 |  999 |             -2 |
   +-------+------+----------------+
   ```
   
   The hudi version has the following definition. You can see that the 
`logical_type` info has been lost.
   ```
   ############ Column(short_column) ############
   name: short_column
   path: short_column
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT32
   logical_type: None
   converted_type (legacy): NONE
   ```
   
   The hudi parquet dump shows the following output, which demonstrate that 
"short_column" is mistranslated to 65534.
   ```
   parquet-tools show /tmp/output_hudi/*.parquet
   
+-----------------------+------------------------+----------------------+--------------------------+-------------------------------------------------------------------------+-------+------+----------------+
   |   _hoodie_commit_time |   _hoodie_commit_seqno |   _hoodie_record_key | 
_hoodie_partition_path   | _hoodie_file_name                                    
                   |   _ts |   id |   short_column |
   
|-----------------------+------------------------+----------------------+--------------------------+-------------------------------------------------------------------------+-------+------+----------------|
   |        20211206215931 |     20211206215931_0_1 |                  999 |    
                      | 
7665aecb-3efc-4e83-851b-002d9abfcd59-0_0-21-4510_20211206215931.parquet |   100 
|  999 |          65534 |
   
+-----------------------+------------------------+----------------------+--------------------------+-------------------------------------------------------------------------+-------+------+----------------+
   ```
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 
   - hudi-spark3-bundle_2.12
   - spark-avro_2.12
   
   * Running on Docker? (yes/no) : yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] askldjd opened a new issue #4233: [SUPPORT] Hudi parquet INT16 conversion loses LogicalType information

Reply via email to