Sagar Sumit created HUDI-7964:
---------------------------------
Summary: Partitions not created correctly with SQL when multiple
partitions specified out of order
Key: HUDI-7964
URL: https://issues.apache.org/jira/browse/HUDI-7964
Project: Apache Hudi
Issue Type: Task
Reporter: Sagar Sumit
Fix For: 1.0.0
Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png
When multiple partitions are specified out of order (as compared to the order
of fields in the create table command), the partitioning on storage is
incorrect. Test script (notice that create table or insert into command has
city and then state, while the partitioned by clause has state first and then
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;
CREATE TABLE hudi_table_mlp ( ts BIGINT, id STRING, rider STRING,
driver STRING, fare DOUBLE, city STRING, state STRING) USING
HUDIoptions( primaryKey ='id', preCombineField = 'ts',
hoodie.metadata.record.index.enable = 'true')PARTITIONED BY (state,
city)location 'file:///tmp/hudi_table_mlp';
INSERT INTO hudi_table_mlp VALUES
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');INSERT
INTO hudi_table_mlp VALUES
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');INSERT
INTO hudi_table_mlp VALUES
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');INSERT
INTO hudi_table_mlp VALUES
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
{code}
This creates partition as follows (note that city and state values are swapped):
!Screenshot 2024-07-06 at 11.34.17 AM.png!
Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas';
24/07/06 11:30:36 INFO HoodieFileIndex: Using provided predicates to prune
number of target table's partitions scanned from 4 to 0
Time taken: 0.056 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent
regression.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)