[I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]

via GitHub Mon, 27 Nov 2023 12:24:44 -0800


sayanpaul-plaid opened a new issue, #10182:
URL: https://github.com/apache/hudi/issues/10182


   **Describe the problem you faced**
   
   We've observed that the `CREATE TABLE` DDL alphabetizes partition column 
names when syncing to Glue. The values in `hoodie.properties` are correct; this 
seems to only affect the Glue table. While this doesn't impact reads from 
Spark, it seems that it causes issues for Trino.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table with the following code. Note that the partitioning 
columns are specified in `c, a, b` order.
       ```python
       df = spark.createDataFrame([{"a": 1, "b": 1, "c": 1, "d": 1}, {"a": 2, 
"b": 2, "c": 2, "d": 1}])
       
       location = "s3://..."
       
       df.write.format("hudi").options(
           **{
               'hoodie.bootstrap.index.enable': 'false',
               'hoodie.datasource.write.hive_style_partitioning': 'true',
               'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
               'hoodie.datasource.write.operation': 'upsert',
               'hoodie.datasource.write.partitionpath.field': 
'c:SIMPLE,a:SIMPLE,b:SIMPLE',
               'hoodie.datasource.write.precombine.field': 'd',
               'hoodie.datasource.write.recordkey.field': 'd',
               'hoodie.datasource.write.table.name': 
'test_nonalpha_partitioning',
               'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
               'hoodie.table.name': 'test_nonalpha_partitioning',
           }
       ).save(location)
       
       spark.sql(f"""
           create table prototype_lakehouse_testing.test_nonalpha_partitioning
           using hudi
           location '{location}'
       """)
       ```
   2. Observe that the Glue table reports partition columns in alphabetical 
order:
       ```
       ❯ aws glue get-table --database-name 'prototype_lakehouse_testing' 
--name 'test_nonalpha_partitioning' | jq '.Table.PartitionKeys'
       [
         {
           "Name": "a",
           "Type": "bigint"
         },
         {
           "Name": "b",
           "Type": "bigint"
         },
         {
           "Name": "c",
           "Type": "bigint"
         }
       ]
       ```
       while the table's `hoodie.properties` reports 
`hoodie.table.partition.fields=c,a,b`
   
   **Expected behavior**
   
   We expect the Glue table to preserve the partition column order.
   
   **Environment Description**
   
   The above was run on an AWS EMR cluster running version `emr-6.10.1`
   
   * Hudi version : `0.12.2-amzn-0`
   
   * Spark version : `3.3.1`
   
   * Hive version `3.1.3`
   
   * Hadoop version : `3.3.3`
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : Spark on Docker
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   n/a
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]

Reply via email to