Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r199389588
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -182,18 +182,20 @@ private[parquet] class ParquetRowConverter(
// Converters for each field.
private val fieldConverters: Array[Converter with
HasParentContainerUpdater] = {
- parquetType.getFields.asScala.zip(catalystType).zipWithIndex.map {
- case ((parquetFieldType, catalystField), ordinal) =>
- // Converted field value should be set to the `ordinal`-th cell of
`currentRow`
- newConverter(parquetFieldType, catalystField.dataType, new
RowUpdater(currentRow, ordinal))
+ parquetType.getFields.asScala.map {
+ case parquetField =>
+ val fieldIndex = catalystType.fieldIndex(parquetField.getName)
--- End diff --
The name can be used as the identifiers? Could you double check whether we
can save the a parquet file with duplicate column names? [Note: the previous
version of Spark does not check name duplication. Thus, I guess the previous
version of Spark might generate the file with duplicate column names]
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]