cloud-fan commented on code in PR #52557:
URL: https://github.com/apache/spark/pull/52557#discussion_r2441533153


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java:
##########
@@ -309,6 +316,58 @@ public void initBatch(StructType partitionColumns, 
InternalRow partitionValues)
     initBatch(MEMORY_MODE, partitionColumns, partitionValues);
   }
 
+  /**
+   * Keeps the hierarchy and fields of readType, recursively truncating struct 
fields from the end
+   * of the fields list to match the same number of fields in requestedType. 
This is used to get rid
+   * of the extra fields that are added to the structs when the fields we 
wanted to read initially
+   * were missing in the file schema. So this returns a type that we would be 
reading if everything
+   * was present in the file, matching Spark's expected schema.
+   *
+   * <p> Example: <pre>{@code
+   * readType:      array<struct<a:int,b:long,c:int>>
+   * requestedType: array<struct<a:int,b:long>>
+   * returns:       array<struct<a:int,b:long>>
+   * }</pre>
+   * We cannot return requestedType here because there might be slight 
differences, like nullability

Review Comment:
   shall we choose a different example? the current example simplify returns 
`requestedType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to