[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

gatorsmile Sun, 01 Jul 2018 22:49:08 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21320#discussion_r199389588
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ---
    @@ -182,18 +182,20 @@ private[parquet] class ParquetRowConverter(
     
       // Converters for each field.
       private val fieldConverters: Array[Converter with 
HasParentContainerUpdater] = {
    -    parquetType.getFields.asScala.zip(catalystType).zipWithIndex.map {
    -      case ((parquetFieldType, catalystField), ordinal) =>
    -        // Converted field value should be set to the `ordinal`-th cell of 
`currentRow`
    -        newConverter(parquetFieldType, catalystField.dataType, new 
RowUpdater(currentRow, ordinal))
    +    parquetType.getFields.asScala.map {
    +      case parquetField =>
    +        val fieldIndex = catalystType.fieldIndex(parquetField.getName)
    --- End diff --
    
    The name can be used as the identifiers? Could you double check whether we 
can save the a parquet file with duplicate column names? [Note: the previous 
version of Spark does not check name duplication. Thus, I guess the previous 
version of Spark might generate the file with duplicate column names]



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

Reply via email to