Sagar Sumit created HUDI-7964:
---------------------------------

             Summary: Partitions not created correctly with SQL when multiple 
partitions specified out of order
                 Key: HUDI-7964
                 URL: https://issues.apache.org/jira/browse/HUDI-7964
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Sagar Sumit
             Fix For: 1.0.0
         Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png

When multiple partitions are specified out of order (as compared to the order 
of fields in the create table command), the partitioning on storage is 
incorrect. Test script (notice that create table or insert into command has 
city and then state, while the partitioned by clause has state first and then 
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;
CREATE TABLE hudi_table_mlp (    ts BIGINT,    id STRING,    rider STRING,    
driver STRING,    fare DOUBLE,    city STRING,    state STRING) USING 
HUDIoptions(    primaryKey ='id',    preCombineField = 'ts',    
hoodie.metadata.record.index.enable = 'true')PARTITIONED BY (state, 
city)location 'file:///tmp/hudi_table_mlp';

INSERT INTO hudi_table_mlp VALUES 
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');INSERT
 INTO hudi_table_mlp VALUES 
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');INSERT
 INTO hudi_table_mlp VALUES 
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');INSERT
 INTO hudi_table_mlp VALUES 
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
 {code}
This creates partition as follows (note that city and state values are swapped):

!Screenshot 2024-07-06 at 11.34.17 AM.png!

Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas';
24/07/06 11:30:36 INFO HoodieFileIndex: Using provided predicates to prune 
number of target table's partitions scanned from 4 to 0
Time taken: 0.056 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
regression.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to