Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/15951#discussion_r89012513
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
---
@@ -84,30 +84,96 @@ case class DataSource(
private val caseInsensitiveOptions = new CaseInsensitiveMap(options)
/**
- * Infer the schema of the given FileFormat, returns a pair of schema
and partition column names.
+ * Get the schema of the given FileFormat, if provided by
`userSpecifiedSchema`, or try to infer
+ * it. In the read path, only Hive managed tables provide the partition
columns properly when
+ * initializing this class. All other file based data sources will try
to infer the partitioning,
+ * and then cast the inferred types to user specified dataTypes if the
partition columns exist
+ * inside `userSpecifiedSchema`, otherwise we can hit data corruption
bugs like SPARK-18510.
+ * This method will try to do the least amount of work given whether
`userSpecifiedSchema` and
+ * `partitionColumns` are provided. Here are some code paths that use
this method:
--- End diff --
Can you document what "least amount of work" is? That is, it will skip file
scanning if .....
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]