rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r513812453



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       I've been thinking about this suggestion. I think there are two ways we 
might implement the function-style approach that I'll respond to separately.
   
   First, we could push the function to the data source so that metadata 
columns are no longer hidden columns. When requesting a projection, Spark would 
also pass an Expression (the public one) and that would be something like 
`metadata_column("_file")` or `input_file()`. This would require a different 
interface, but would also be more flexible (metadata tables could return things 
like `row_count()`). The trade-off is that the rules are a lot more 
complicated. The optimizer is needed to push expressions down to the leaves so 
that these functions can be pushed to the source. But, non-deterministic 
expressions can prevent that pushdown, which could lead to some cases where the 
function works and some where it doesn't because there is no way to run the 
function without pushing it to the data source. And, the function would need to 
_appear_ like it is resolved when it actually isn't to make it to the 
optimizer. Also, I think we discussed this option in the v2 syncs and came to 
the conclu
 sion that it would be simpler to project columns.
   
   Second, we could use a function to signal that a column is a metadata column 
only, and keep the interface to sources the same. So the way to project `_file` 
is via `metadata_column("_file")`, and Spark will add `_file` to the output of 
the relation (and request it through the DSv2 API) if it finds a metadata 
column with that name. That is a nice way to avoid the current way that columns 
are exposed through `metadataColumns` in the Spark plan. But it also has the 
drawbacks of the first option: converting the function call into an attribute 
reference is much more complicated.
   
   The current implementation works well. We've been running this code in 
production for months now, so I would prefer to move forward with this approach.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to