[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

cloud-fan Mon, 21 Nov 2016 18:55:51 -0800

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/15951
  
    ```
    spark.read
      .schema(someSchemaWherePartitionColumnsAreStrings)
    ```
    I don't think this is valid use case, `DataFrameReader` can't specify 
partition columns, so we will always infer partitions.
    
    I think the real problem is `HadoopFsRelation.schema`:
    ```
    val schema: StructType = {
      val dataSchemaColumnNames = dataSchema.map(_.name.toLowerCase).toSet
      StructType(dataSchema ++ partitionSchema.filterNot { column =>
        dataSchemaColumnNames.contains(column.name.toLowerCase)
      })
    }
    ```
    It sliently drops the partition schema if the partition column names are 
duplicated in data schema.
    
    I think the best solution is to add `partitionBy` in `DataFrameReader` so 
that we can skip inferring partitions really. But this maybe too late for 2.1, 
we should define a better semantic for the current "broken" API.
    
    > Once we find what the partition columns are, we try to find them in the 
user specified schema and use the dataType provided there, or fall back to the 
smallest common data type.
    
    This LGTM



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

Reply via email to