[ 
https://issues.apache.org/jira/browse/HUDI-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866759#comment-17866759
 ] 

YangXuan commented on HUDI-7964:
--------------------------------

Hi, The last few fields in values() are the partition columns. The order of the 
partition columns is the order specified in PARTITIONED BY (), and it is not 
related to the order specified in the create table statement. This has been the 
case since version 0.9. So, is this considered a problem?

> Partitions not created correctly with SQL when multiple partitions specified 
> out of order
> -----------------------------------------------------------------------------------------
>
>                 Key: HUDI-7964
>                 URL: https://issues.apache.org/jira/browse/HUDI-7964
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Sagar Sumit
>            Priority: Major
>              Labels: spark-sql
>             Fix For: 1.0.0
>
>         Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png, Screenshot 
> 2024-07-11 at 5.43.41 PM.png
>
>
> When multiple partitions are specified out of order (as compared to the order 
> of fields in the create table command), the partitioning on storage is 
> incorrect. Test script (notice that create table or insert into command has 
> city and then state, while the partitioned by clause has state first and then 
> city):
> {code:java}
> DROP TABLE IF EXISTS hudi_table_mlp;
> CREATE TABLE hudi_table_mlp (    
>   ts BIGINT,    
>   id STRING,    
>   rider STRING,    
>   driver STRING,    
>   fare DOUBLE,    
>   city STRING,    
>   state STRING) 
> USING HUDI options(    
>   primaryKey ='id',    
>   preCombineField = 'ts')
> PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';
> INSERT INTO hudi_table_mlp VALUES 
> (1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
> INSERT INTO hudi_table_mlp VALUES 
> (1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
>  {code}
> This creates partition as follows (note that city and state values are 
> swapped):
> !Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!
> Now, if i query with state='texas' filter, there are no results:
> {code:java}
> spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
> Time taken: 0.356 seconds {code}
> I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
> regression.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to