Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/8509#discussion_r38866955
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
---
@@ -160,4 +101,168 @@ private[parquet] object CatalystReadSupport {
val SPARK_ROW_REQUESTED_SCHEMA =
"org.apache.spark.sql.parquet.row.requested_schema"
val SPARK_METADATA_KEY = "org.apache.spark.sql.parquet.row.metadata"
+
+ /**
+ * Tailors `parquetSchema` according to `catalystSchema` by removing
column paths don't exist
+ * in `catalystSchema`, and adding those only exist in `catalystSchema`.
+ */
+ def clipParquetSchema(parquetSchema: MessageType, catalystSchema:
StructType): MessageType = {
+ val clippedParquetFields =
clipParquetGroupFields(parquetSchema.asGroupType(), catalystSchema)
+ Types.buildMessage().addFields(clippedParquetFields: _*).named("root")
+ }
+
+ private def clipParquetType(parquetType: Type, catalystType: DataType):
Type = {
+ catalystType match {
+ case t: ArrayType if !isPrimitiveCatalystType(t.elementType) =>
+ // Only clips array types with nested type as element type.
+ clipParquetListType(parquetType.asGroupType(), t.elementType)
+
+ case t: MapType if !isPrimitiveCatalystType(t.valueType) =>
+ // Only clips map types with nested type as value type.
+ clipParquetMapType(parquetType.asGroupType(), t.keyType,
t.valueType)
+
+ case t: StructType =>
+ clipParquetGroup(parquetType.asGroupType(), t)
+
+ case _ =>
+ parquetType
--- End diff --
At first I thought it would be too complicated to add this assertion here
since there can be multiple Parquet representation for a single Catalyst type,
and some of them may even conflict with each other. But I just realized that we
can simply resort to `CatalystSchemaConverter` to convert `parquetType` to a
Catalyst type and see whether the result matches `catalystType`. This is
because the mapping from Catalyst type to Parquet type is a one-to-many mapping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]