[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-26188: ------------------------------------ Assignee: Apache Spark > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > ---------------------------------------------------------------- > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Damien Doucet-Girard > Assignee: Apache Spark > Priority: Minor > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, > range: 0-662, partition values: [ba] > 18/11/27 14:02:51 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, > range: 0-662, partition values: [2.0] > 18/11/27 14:02:52 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, > range: 0-662, partition values: [3] > 18/11/27 14:02:53 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, > range: 0-662, partition values: [57] > 18/11/27 14:02:54 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, > range: 0-662, partition values: [81] > 18/11/27 14:02:55 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, > range: 0-662, partition values: [6.0] > 18/11/27 14:02:56 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, > range: 0-662, partition values: [36] > 18/11/27 14:02:57 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, > range: 0-662, partition values: [4b]{code} > 2.3.2: > {code:java} > 18/11/27 14:09:00 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=60/part-00082-hashredacted.parquet, > range: 0-662, partition values: [60] > 18/11/27 14:09:01 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [00] > 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:09:03 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:09:07 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4d] > 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:09:10 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, > range: 0-662, partition values: [ba] > 18/11/27 14:09:14 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, > range: 0-662, partition values: [2d] > 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, > range: 0-662, partition values: [03] > 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, > range: 0-662, partition values: [57] > 18/11/27 14:09:16 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, > range: 0-662, partition values: [81] > 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, > range: 0-662, partition values: [6d] > 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, > range: 0-662, partition values: [36] > 18/11/27 14:09:18 INFO FileScanRDD: Reading File path: > s3a://bucketnameredacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, > range: 0-662, partition values: [4b] > {code} > After some investigation, we've isolated the issue to > > [https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136] > > In the inferPartitioning method, 2.3.2 sets the type inference to false by > default: > {code:java} > val spec = PartitioningUtils.parsePartitions( > leafDirs, > typeInference = false, > basePaths = basePaths, > timeZoneId = timeZoneId){code} > However, in version 2.4.0, the typeInference flag has been replace with a > config flag > [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133] > > {code:java} > val inferredPartitionSpec = PartitioningUtils.parsePartitions( > leafDirs, > typeInference = > sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled, > basePaths = basePaths, > timeZoneId = timeZoneId){code} > And this conf's default value is true > {code:java} > val PARTITION_COLUMN_TYPE_INFERENCE = > buildConf("spark.sql.sources.partitionColumnTypeInference.enabled") > .doc("When true, automatically infer the data types for partitioned columns.") > .booleanConf > .createWithDefault(true){code} > [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640] > > I was wondering if a bug report would be appropriate to preserve backwards > compatibility and change the default conf value to false. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org