bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517176346
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
import DataSourceV2Implicits._
+ override lazy val metadataOutput: Seq[AttributeReference] = table match {
+ case hasMeta: SupportsMetadataColumns =>
+ val attrs = hasMeta.metadataColumns
+ val outputNames = outputSet.map(_.name).toSet
+ attrs.filterNot(col => outputNames.contains(col.name)).toAttributes
Review comment:
I think the column approach has the following benefits over the function
approach:
- You can scope the location of the calls (you can't reference the metadata
unless you're directly SELECTing from the table, so no calls in places where it
doesn't make sense),
- You can call them immediately from a join, eg. `SELECT
t1._input_file_name, t2._input_file_name FROM Table1 t1 JOIN Table2 t2`. The
function syntax doesn't allow this, unless you make it something "special" like
`input_file_name(t1)`.
- You can easily push down the projection using existing mechanisms to turn
off the overhead of generating the columns. The semantics of such pushdown with
functions is iffy at best.
- If you convert this to the physical level directly, a column is more
compatible with physical transformations e.g. things that add readahead and
buffering. The current approach like `input_file_name()` fundamentally makes
buffering and readahead hard. You'd have to convert it to columns under the
covers anyway.
However, the column approach has name conflicts. This gets problematic if
people actually select these columns without assigning new aliases and then
write the result to files, and then try to read those back. That means that
either DataFrames don't round trip with 100% fidelity via files (which may
break assumptions of libraries that people have built), or you can't access
metadata columns on files that have columns with these names, which breaks
libraries that assume that these features work. If I'd write
`table._some_metadata_column` in an auditing-related library, I'd like to be
sure that I get what I asked for, and that this cannot be spoofed.
How about using _special syntax_ like `input_file_name(t1)`, or
`metadata(t1, file_name)`, where t1 must be a table (_data source_, not
subquery/view) in the FROM clause, optional if there's only one table. So it's
not a real function, it's just function-call-like syntax. That can be converted
into a special type of column reference under the covers, even at parsing time,
giving it all the advantages of columns from an implementation perspective. The
column reference could use a reserved name that is not possible to write
otherwise. When used in the SELECT list, we wouldn't assign that column name as
the SELECT list alias, so it wouldn't show up in files! You still get an error
at analysis time if you call it from a place where there's no table, so there's
no incorrect use like with normal nondeterministic functions. This has the
advantage that you can't have name conflicts. If we would add
`input_file_name(t1)` and `input_file_name()` as such special functions with
special syntax
, then we'd have backward compatibility with existing queries, and we could
migrate that mechanism to the new mechanism. And to avoid having to add extra
functions for everything, we'd add`metadata(...)` for all other metadata
columns.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]