[
https://issues.apache.org/jira/browse/HUDI-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sagar Sumit updated HUDI-7964:
------------------------------
Description:
When multiple partitions are specified out of order (as compared to the order
of fields in the create table command), the partitioning on storage is
incorrect. Test script (notice that create table or insert into command has
city and then state, while the partitioned by clause has state first and then
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;
CREATE TABLE hudi_table_mlp (
ts BIGINT,
id STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING,
state STRING)
USING HUDI options(
primaryKey ='id',
preCombineField = 'ts')
PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';
INSERT INTO hudi_table_mlp VALUES
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
INSERT INTO hudi_table_mlp VALUES
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
INSERT INTO hudi_table_mlp VALUES
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
INSERT INTO hudi_table_mlp VALUES
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
{code}
This creates partition as follows (note that city and state values are swapped):
!Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!
Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
Time taken: 0.356 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent
regression.
was:
When multiple partitions are specified out of order (as compared to the order
of fields in the create table command), the partitioning on storage is
incorrect. Test script (notice that create table or insert into command has
city and then state, while the partitioned by clause has state first and then
city):
{code:java}
DROP TABLE IF EXISTS hudi_table_mlp;
CREATE TABLE hudi_table_mlp ( ts BIGINT, id STRING, rider STRING,
driver STRING, fare DOUBLE, city STRING, state STRING) USING
HUDIoptions( primaryKey ='id', preCombineField = 'ts',
hoodie.metadata.record.index.enable = 'true')PARTITIONED BY (state,
city)location 'file:///tmp/hudi_table_mlp';
INSERT INTO hudi_table_mlp VALUES
(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');INSERT
INTO hudi_table_mlp VALUES
(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');INSERT
INTO hudi_table_mlp VALUES
(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');INSERT
INTO hudi_table_mlp VALUES
(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
{code}
This creates partition as follows (note that city and state values are swapped):
!Screenshot 2024-07-06 at 11.34.17 AM.png!
Now, if i query with state='texas' filter, there are no results:
{code:java}
spark-sql> select * from hudi_table_mlp where state='texas';
24/07/06 11:30:36 INFO HoodieFileIndex: Using provided predicates to prune
number of target table's partitions scanned from 4 to 0
Time taken: 0.056 seconds {code}
I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent
regression.
> Partitions not created correctly with SQL when multiple partitions specified
> out of order
> -----------------------------------------------------------------------------------------
>
> Key: HUDI-7964
> URL: https://issues.apache.org/jira/browse/HUDI-7964
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Sagar Sumit
> Priority: Major
> Labels: spark-sql
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png, Screenshot
> 2024-07-11 at 5.43.41 PM.png
>
>
> When multiple partitions are specified out of order (as compared to the order
> of fields in the create table command), the partitioning on storage is
> incorrect. Test script (notice that create table or insert into command has
> city and then state, while the partitioned by clause has state first and then
> city):
> {code:java}
> DROP TABLE IF EXISTS hudi_table_mlp;
> CREATE TABLE hudi_table_mlp (
> ts BIGINT,
> id STRING,
> rider STRING,
> driver STRING,
> fare DOUBLE,
> city STRING,
> state STRING)
> USING HUDI options(
> primaryKey ='id',
> preCombineField = 'ts')
> PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';
> INSERT INTO hudi_table_mlp VALUES
> (1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
> INSERT INTO hudi_table_mlp VALUES
> (1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
> INSERT INTO hudi_table_mlp VALUES
> (1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
> INSERT INTO hudi_table_mlp VALUES
> (1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
> {code}
> This creates partition as follows (note that city and state values are
> swapped):
> !Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!
> Now, if i query with state='texas' filter, there are no results:
> {code:java}
> spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
> Time taken: 0.356 seconds {code}
> I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent
> regression.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)