prakharjain09 commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r756450557
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
##########
@@ -67,6 +67,27 @@ case class LogicalRelation(
s"Relation ${catalogTable.map(_.identifier.unquotedString).getOrElse("")}"
+
s"[${truncatedString(output, ",", maxFields)}] $relation"
}
+
+ override lazy val metadataOutput: Seq[AttributeReference] = relation match {
+ case _: HadoopFsRelation =>
+ val resolve = conf.resolver
+ val outputNames = outputSet.map(_.name)
+ def isOutputColumn(col: AttributeReference): Boolean = {
+ outputNames.exists(name => resolve(col.name, name))
+ }
+ // filter out metadata columns that have names conflicting with output
columns. if the file
+ // has a column "_metadata", then the data column should be returned not
the metadata column
+ Seq(FileFormat.FILE_METADATA_COLUMNS).filterNot(isOutputColumn)
+ case _ => Nil
+ }
+
+ override def withMetadataColumns(): LogicalRelation = {
+ if (metadataOutput.nonEmpty) {
+ this.copy(output = output ++ metadataOutput)
Review comment:
In this copy, the underlying hadoopFsRelation remains same and so
`dataSchema` field in HadoopFsRelation doesn't have the metadata columns. This
breaks the rule SchemaPruning. That rule uses the `dataSchema` and ends up
creating an invalid plan.
This will need some handling by either fixing the dataSchema in
HadoopFsRelation or by fixing the rule.
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
##########
@@ -276,3 +276,10 @@ object LogicalPlanIntegrity {
checkIfSameExprIdNotReused(plan) && hasUniqueExprIdsForOutput(plan)
}
}
+
+/**
+ * A logical plan node that can generate metadata columns
+ */
+trait ExposesMetadataColumns extends LogicalPlan {
+ def withMetadataColumns(): ExposesMetadataColumns
Review comment:
Why do we need it to return `ExposesMetadataColumns`? Can't it just
return LogicalPlan?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]