[jira] [Assigned] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

Apache Spark (Jira) Fri, 13 Jan 2023 11:12:10 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-41982:
------------------------------------

    Assignee: Apache Spark

> When the inserted partition type is of string type, similar `dt=01` will be 
> converted to `dt=1`
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41982
>                 URL: https://issues.apache.org/jira/browse/SPARK-41982
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: jingxiong zhong
>            Assignee: Apache Spark
>            Priority: Critical
>
> At present, during the process of upgrading Spark2.4 to Spark3.2, we 
> carefully read the migration documentwe and found a kind of situation not 
> involved:
> {code:java}
> create table if not exists test_90(a string, b string) partitioned by (dt 
> string);
> desc formatted test_90;
> // case1
> insert into table test_90 partition (dt=05) values("1","2");
> // case2
> insert into table test_90 partition (dt='05') values("1","2");
> drop table test_90;{code}
> in spark2.4.3, it will generate such a path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> //result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       05
> Time taken: 1.316 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90; 
> dt=05 
> Time taken: 0.201 seconds, Fetched 1 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> 1       2       05
> Time taken: 0.212 seconds, Fetched 2 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) 
> values("1","2");
> == Physical Plan ==
> Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
> [a, b]
> +- LocalTableScan [a#116, b#117]
> Time taken: 1.145 seconds, Fetched 1 row(s){code}
> in spark3.2.0, it will generate two path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> hdfs://test5/user/hive/db1/test_90/dt=5 
> // result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       5
> Time taken: 2.119 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90;
> dt=05
> dt=5
> Time taken: 0.161 seconds, Fetched 2 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> Time taken: 0.252 seconds, Fetched 1 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) 
> values("1","2");
> plan
> == Physical Plan ==
> Execute InsertIntoHiveTable `db1`.`test_90`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
> +- LocalTableScan [a#109, b#110]{code}
> This will cause problems in reading data after the user switches to spark3. 
> The root cause is that in the process of partition field resolution, Spark3 
> has a process of strongly converting this string type, which will cause 
> partition `05` to lose the previous `0`
> So I think we have two solutions:
> one is to record the risk clearly in the migration document, and the other is 
> to repair this case, because we internally keep the partition of string type 
> as string type, regardless of whether single or double quotation marks are 
> added.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

Reply via email to