[ 
https://issues.apache.org/jira/browse/HUDI-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7964:
------------------------------
    Description: 
When multiple partitions are specified out of order (as compared to the order 
of fields in the create table command), the partitioning on storage is 
incorrect. Test script (notice that create table or insert into command has 
city and then state, while the partitioned by clause has state first and then 
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;

CREATE TABLE hudi_table_mlp (    
  ts BIGINT,    
  id STRING,    
  rider STRING,    
  driver STRING,    
  fare DOUBLE,    
  city STRING,    
  state STRING) 
USING HUDI options(    
  primaryKey ='id',    
  preCombineField = 'ts')
PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';

INSERT INTO hudi_table_mlp VALUES 
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
INSERT INTO hudi_table_mlp VALUES 
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
INSERT INTO hudi_table_mlp VALUES 
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
INSERT INTO hudi_table_mlp VALUES 
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
 {code}
This creates partition as follows (note that city and state values are swapped):

!Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!

Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
Time taken: 0.356 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
regression.

 

  was:
When multiple partitions are specified out of order (as compared to the order 
of fields in the create table command), the partitioning on storage is 
incorrect. Test script (notice that create table or insert into command has 
city and then state, while the partitioned by clause has state first and then 
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;
CREATE TABLE hudi_table_mlp (    ts BIGINT,    id STRING,    rider STRING,    
driver STRING,    fare DOUBLE,    city STRING,    state STRING) USING 
HUDIoptions(    primaryKey ='id',    preCombineField = 'ts',    
hoodie.metadata.record.index.enable = 'true')PARTITIONED BY (state, 
city)location 'file:///tmp/hudi_table_mlp';

INSERT INTO hudi_table_mlp VALUES 
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');INSERT
 INTO hudi_table_mlp VALUES 
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');INSERT
 INTO hudi_table_mlp VALUES 
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');INSERT
 INTO hudi_table_mlp VALUES 
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
 {code}
This creates partition as follows (note that city and state values are swapped):

!Screenshot 2024-07-06 at 11.34.17 AM.png!

Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas';
24/07/06 11:30:36 INFO HoodieFileIndex: Using provided predicates to prune 
number of target table's partitions scanned from 4 to 0
Time taken: 0.056 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
regression.

 


> Partitions not created correctly with SQL when multiple partitions specified 
> out of order
> -----------------------------------------------------------------------------------------
>
>                 Key: HUDI-7964
>                 URL: https://issues.apache.org/jira/browse/HUDI-7964
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Sagar Sumit
>            Priority: Major
>              Labels: spark-sql
>             Fix For: 1.0.0
>
>         Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png, Screenshot 
> 2024-07-11 at 5.43.41 PM.png
>
>
> When multiple partitions are specified out of order (as compared to the order 
> of fields in the create table command), the partitioning on storage is 
> incorrect. Test script (notice that create table or insert into command has 
> city and then state, while the partitioned by clause has state first and then 
> city):
> {code:java}
> DROP TABLE IF EXISTS hudi_table_mlp;
> CREATE TABLE hudi_table_mlp (    
>   ts BIGINT,    
>   id STRING,    
>   rider STRING,    
>   driver STRING,    
>   fare DOUBLE,    
>   city STRING,    
>   state STRING) 
> USING HUDI options(    
>   primaryKey ='id',    
>   preCombineField = 'ts')
> PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';
> INSERT INTO hudi_table_mlp VALUES 
> (1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
> INSERT INTO hudi_table_mlp VALUES 
> (1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
>  {code}
> This creates partition as follows (note that city and state values are 
> swapped):
> !Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!
> Now, if i query with state='texas' filter, there are no results:
> {code:java}
> spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
> Time taken: 0.356 seconds {code}
> I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
> regression.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to