rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r513812453
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
import DataSourceV2Implicits._
+ override lazy val metadataOutput: Seq[AttributeReference] = table match {
+ case hasMeta: SupportsMetadataColumns =>
+ val attrs = hasMeta.metadataColumns
+ val outputNames = outputSet.map(_.name).toSet
+ attrs.filterNot(col => outputNames.contains(col.name)).toAttributes
Review comment:
I've been thinking about this suggestion. I think there are two ways we
might implement the function-style approach that I'll respond to separately.
First, we could push the function to the data source so that metadata
columns are no longer hidden columns. When requesting a projection, Spark would
also pass an Expression (the public one) and that would be something like
`metadata_column("_file")` or `input_file()`. This would require a different
interface, but would also be more flexible (metadata tables could return things
like `row_count()`). The trade-off is that the rules are a lot more
complicated. The optimizer is needed to push expressions down to the leaves so
that these functions can be pushed to the source. But, non-deterministic
expressions can prevent that pushdown, which could lead to some cases where the
function works and some where it doesn't because there is no way to run the
function without pushing it to the data source. And, the function would need to
_appear_ like it is resolved when it actually isn't to make it to the
optimizer. Also, I think we discussed this option in the v2 syncs and came to
the conclu
sion that it would be simpler to project columns.
Second, we could use a function to signal that a column is a metadata column
only, and keep the interface to sources the same. So the way to project `_file`
is via `metadata_column("_file")`, and Spark will add `_file` to the output of
the relation (and request it through the DSv2 API) if it finds a metadata
column with that name. That is a nice way to avoid the current way that columns
are exposed through `metadataColumns` in the Spark plan. But it also has the
drawbacks of the first option: converting the function call into an attribute
reference is much more complicated.
The current implementation works well. We've been running this code in
production for months now, so I would prefer to move forward with this approach.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]