[
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-41982:
------------------------------------
Assignee: Apache Spark
> When the inserted partition type is of string type, similar `dt=01` will be
> converted to `dt=1`
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-41982
> URL: https://issues.apache.org/jira/browse/SPARK-41982
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.0
> Reporter: jingxiong zhong
> Assignee: Apache Spark
> Priority: Critical
>
> At present, during the process of upgrading Spark2.4 to Spark3.2, we
> carefully read the migration documentwe and found a kind of situation not
> involved:
> {code:java}
> create table if not exists test_90(a string, b string) partitioned by (dt
> string);
> desc formatted test_90;
> // case1
> insert into table test_90 partition (dt=05) values("1","2");
> // case2
> insert into table test_90 partition (dt='05') values("1","2");
> drop table test_90;{code}
> in spark2.4.3, it will generate such a path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05
> //result
> spark-sql> select * from test_90;
> 1 2 05
> 1 2 05
> Time taken: 1.316 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90;
> dt=05
> Time taken: 0.201 seconds, Fetched 1 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1 2 05
> 1 2 05
> Time taken: 0.212 seconds, Fetched 2 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05)
> values("1","2");
> == Physical Plan ==
> Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`,
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false,
> [a, b]
> +- LocalTableScan [a#116, b#117]
> Time taken: 1.145 seconds, Fetched 1 row(s){code}
> in spark3.2.0, it will generate two path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05
> hdfs://test5/user/hive/db1/test_90/dt=5
> // result
> spark-sql> select * from test_90;
> 1 2 05
> 1 2 5
> Time taken: 2.119 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90;
> dt=05
> dt=5
> Time taken: 0.161 seconds, Fetched 2 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1 2 05
> Time taken: 0.252 seconds, Fetched 1 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05)
> values("1","2");
> plan
> == Physical Plan ==
> Execute InsertIntoHiveTable `db1`.`test_90`,
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
> +- LocalTableScan [a#109, b#110]{code}
> This will cause problems in reading data after the user switches to spark3.
> The root cause is that in the process of partition field resolution, Spark3
> has a process of strongly converting this string type, which will cause
> partition `05` to lose the previous `0`
> So I think we have two solutions:
> one is to record the risk clearly in the migration document, and the other is
> to repair this case, because we internally keep the partition of string type
> as string type, regardless of whether single or double quotation marks are
> added.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]