sayanpaul-plaid opened a new issue, #10182:
URL: https://github.com/apache/hudi/issues/10182
**Describe the problem you faced**
We've observed that the `CREATE TABLE` DDL alphabetizes partition column
names when syncing to Glue. The values in `hoodie.properties` are correct; this
seems to only affect the Glue table. While this doesn't impact reads from
Spark, it seems that it causes issues for Trino.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a Hudi table with the following code. Note that the partitioning
columns are specified in `c, a, b` order.
```python
df = spark.createDataFrame([{"a": 1, "b": 1, "c": 1, "d": 1}, {"a": 2,
"b": 2, "c": 2, "d": 1}])
location = "s3://..."
df.write.format("hudi").options(
**{
'hoodie.bootstrap.index.enable': 'false',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.partitionpath.field':
'c:SIMPLE,a:SIMPLE,b:SIMPLE',
'hoodie.datasource.write.precombine.field': 'd',
'hoodie.datasource.write.recordkey.field': 'd',
'hoodie.datasource.write.table.name':
'test_nonalpha_partitioning',
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.table.name': 'test_nonalpha_partitioning',
}
).save(location)
spark.sql(f"""
create table prototype_lakehouse_testing.test_nonalpha_partitioning
using hudi
location '{location}'
""")
```
2. Observe that the Glue table reports partition columns in alphabetical
order:
```
❯ aws glue get-table --database-name 'prototype_lakehouse_testing'
--name 'test_nonalpha_partitioning' | jq '.Table.PartitionKeys'
[
{
"Name": "a",
"Type": "bigint"
},
{
"Name": "b",
"Type": "bigint"
},
{
"Name": "c",
"Type": "bigint"
}
]
```
while the table's `hoodie.properties` reports
`hoodie.table.partition.fields=c,a,b`
**Expected behavior**
We expect the Glue table to preserve the partition column order.
**Environment Description**
The above was run on an AWS EMR cluster running version `emr-6.10.1`
* Hudi version : `0.12.2-amzn-0`
* Spark version : `3.3.1`
* Hive version `3.1.3`
* Hadoop version : `3.3.3`
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : Spark on Docker
**Additional context**
Add any other context about the problem here.
**Stacktrace**
n/a
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]