Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22343#discussion_r216218261
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
---
@@ -69,12 +69,25 @@ class ParquetOptions(
.get(MERGE_SCHEMA)
.map(_.toBoolean)
.getOrElse(sqlConf.isParquetSchemaMergingEnabled)
+
+ /**
+ * How to resolve duplicated field names. By default, parquet data
source fails when hitting
+ * duplicated field names in case-insensitive mode. When converting hive
parquet table to parquet
+ * data source, we need to ask parquet data source to pick the first
matched field - the same
+ * behavior as hive parquet table - to keep behaviors consistent.
+ */
+ val duplicatedFieldsResolutionMode: String = {
+ parameters.getOrElse(DUPLICATED_FIELDS_RESOLUTION_MODE,
--- End diff --
I agree this is a little unusual. Usually we have a SQL config first, then
we create an option for it if necessary. In this case, we are not adding a
config/option from user's requirement, but we need it for an internal
optimization.
If we can I would suggest we make it an internal option. But anyway we
shouldn't rush to add a SQL config, until we get requirement from users.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]