rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517639689
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
import DataSourceV2Implicits._
+ override lazy val metadataOutput: Seq[AttributeReference] = table match {
+ case hasMeta: SupportsMetadataColumns =>
+ val attrs = hasMeta.metadataColumns
+ val outputNames = outputSet.map(_.name).toSet
+ attrs.filterNot(col => outputNames.contains(col.name)).toAttributes
Review comment:
@bart-samwel, I like your last suggestion, but with a slight change: I
think sources should reject column names that conflict with their metadata
columns. That way, a source can choose a format that makes sense (e.g.,
__offset) and allow everything else. All Spark has to do is to define what the
behavior will be when there is a conflict, and recommend that sources reject
column names that are also exposed as metadata columns.
I think using a special syntax requires significantly more changes, so I
don't think that's a good option. Specific replies are below.
> If I'd write table._some_metadata_column in an auditing-related library,
I'd like to be sure that I get what I asked for, and that this cannot be
spoofed.
I think this is a fair argument, but I expect it would very rarely be a
problem that you wanted a metadata column and accidentally projected a data
column.
> How about using special syntax
This is what I considered above, "we could use a function to signal that a
column is a metadata column only, and keep the interface to sources the same .
. ." I think the challenge here is carrying the information through analysis
and optimization. This approach either has many of the same complications as
using a function (if kept as an expression), or we would need to introduce a
new attribute type for metadata attrs that would hopefully work in rules just
like normal attributes in existing rules. We can't lose the fact that the
attribute references metadata, so we can't replace a metadata attribute
reference with an attribute reference.
I think the option most likely to not introduce more problems is to use a
regular attribute for these columns, but that eliminates the benefit of a
special syntax to track the fact that they are metadata column references. This
doesn't seem like a good option to me, if all we want to do is guarantee that
metadata columns don't conflict with data columns.
The special syntax is also more awkward to use. One of the use cases for
this feature is to expose non-schema fields like Kafka topic, offset, and
partition. I think it is much more natural to select `_offset` than
`metadata_column("offset")`. And if these are exposed as columns, we can add
them to `DESCRIBE EXTENDED`, like I've done in this PR.
> An alternative would be to make the column names use a recognizable
pattern that we can forbid as column names.
I like this suggestion, but I'd leave rejecting column names to sources as I
said above. Another option is to make this available, but not enforce it. We
could add a check that data columns are not of the form `metadata$.*`, but not
require metadata column names using the pattern. That way there is a way to
guarantee names don't conflict, but if a source chooses to use a name that may
conflict it still works fine.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]