[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26188:
------------------------------------

    Assignee: Apache Spark

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> ----------------------------------------------------------------
>
>                 Key: SPARK-26188
>                 URL: https://issues.apache.org/jira/browse/SPARK-26188
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Damien Doucet-Girard
>            Assignee: Apache Spark
>            Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
> range: 0-662, partition values: [ba]
> 18/11/27 14:02:51 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
> range: 0-662, partition values: [2.0]
> 18/11/27 14:02:52 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
> range: 0-662, partition values: [3]
> 18/11/27 14:02:53 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
> range: 0-662, partition values: [57]
> 18/11/27 14:02:54 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, 
> range: 0-662, partition values: [81]
> 18/11/27 14:02:55 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [6.0]
> 18/11/27 14:02:56 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, 
> range: 0-662, partition values: [36]
> 18/11/27 14:02:57 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [4b]{code}
> 2.3.2:
> {code:java}
> 18/11/27 14:09:00 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=60/part-00082-hashredacted.parquet, 
> range: 0-662, partition values: [60]
> 18/11/27 14:09:01 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [00]
> 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:09:03 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:09:07 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4d]
> 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:09:10 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
> range: 0-662, partition values: [ba]
> 18/11/27 14:09:14 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
> range: 0-662, partition values: [2d]
> 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
> range: 0-662, partition values: [03]
> 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
> range: 0-662, partition values: [57]
> 18/11/27 14:09:16 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, 
> range: 0-662, partition values: [81]
> 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [6d]
> 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, 
> range: 0-662, partition values: [36]
> 18/11/27 14:09:18 INFO FileScanRDD: Reading File path: 
> s3a://bucketnameredacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [4b]
> {code}
> After some investigation, we've isolated the issue to
>  
> [https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
>   
> In the inferPartitioning method, 2.3.2 sets the type inference to false by 
> default:
> {code:java}
> val spec = PartitioningUtils.parsePartitions(
>   leafDirs,
>   typeInference = false,
>   basePaths = basePaths,
>   timeZoneId = timeZoneId){code}
> However, in version 2.4.0, the typeInference flag has been replace with a 
> config flag
> [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]
>  
> {code:java}
> val inferredPartitionSpec = PartitioningUtils.parsePartitions(
>   leafDirs,
>   typeInference = 
> sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
>   basePaths = basePaths,
>   timeZoneId = timeZoneId){code}
> And this conf's default value is true
> {code:java}
> val PARTITION_COLUMN_TYPE_INFERENCE =
> buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
> .doc("When true, automatically infer the data types for partitioned columns.")
> .booleanConf
> .createWithDefault(true){code}
> [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
>   
> I was wondering if a bug report would be appropriate to preserve backwards 
> compatibility and change the default conf value to false.
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to