bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517651552
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
import DataSourceV2Implicits._
+ override lazy val metadataOutput: Seq[AttributeReference] = table match {
+ case hasMeta: SupportsMetadataColumns =>
+ val attrs = hasMeta.metadataColumns
+ val outputNames = outputSet.map(_.name).toSet
+ attrs.filterNot(col => outputNames.contains(col.name)).toAttributes
Review comment:
> @bart-samwel, I like your last suggestion, but with a slight change: I
think sources should reject column names that conflict with their metadata
columns. That way, a source can choose a format that makes sense (e.g.,
__offset) and allow everything else. All Spark has to do is to define what the
behavior will be when there is a conflict, and recommend that sources reject
column names that are also exposed as metadata columns.
They probably should reject such column names, yes. Especially at write time
-- you don't want to be in a situation where you wrote a file and can then
never read it back. But if you did that, how would you deal with it? Could you
never read it with Spark? One alternative is to just hide the column if it's in
the file. You wouldn't be able to access it, but at least that would satisfy
the "auditing library requirement" that the metadata fields always do the
expected thing. Another alternative would be to expose the column under a new
name. But what if that new name is also there? It gets very annoying very
quickly. So maybe just hiding is the best policy.
> I think using a special syntax requires significantly more changes, so I
don't think that's a good option. Specific replies are below.
>
> > If I'd write table._some_metadata_column in an auditing-related library,
I'd like to be sure that I get what I asked for, and that this cannot be
spoofed.
>
> I think this is a fair argument, but I expect it would very rarely be a
problem that you wanted a metadata column and accidentally projected a data
column.
It'll be rare, but if I have an auditing library that e.g. inserts a special
execution node on top of a scan to track exactly what was read, then it would
be great if people can't spoof that by adding a column with a particular
special name to their underlying files. If the internal metadata column names
take precedence over the columns in the file, then that problem is not there.
(FWIW, I usually try to apply the principle of "name resolution must be
environment independent", i.e., if a query defines something locally, e.g. an
alias, then something in the environment (such as a table's schema) shouldn't
be able to make that aspect of the query change meaning, because then things
will break randomly as a result of schema changes. This fits in the same
category.)
> > How about using special syntax
>
> This is what I considered above, "we could use a function to signal that a
column is a metadata column only, and keep the interface to sources the same .
. ." I think the challenge here is carrying the information through analysis
and optimization. This approach either has many of the same complications as
using a function (if kept as an expression), or we would need to introduce a
new attribute type for metadata attrs that would hopefully work in rules just
like normal attributes in existing rules. We can't lose the fact that the
attribute references metadata, so we can't replace a metadata attribute
reference with an attribute reference.
Yeah, I was thinking about the special AttributeReference type. But more in
the way of "use a very special naming convention that you can't type" -- maybe
because we'd reject it in the parser if you'd try to explicitly write an
identifier like that. We'd have to do something to prevent that column name
from showing up in output files etc. though, so it's not entirely foolproof.
> I think the option most likely to not introduce more problems is to use a
regular attribute for these columns, but that eliminates the benefit of a
special syntax to track the fact that they are metadata column references. This
doesn't seem like a good option to me, if all we want to do is guarantee that
metadata columns don't conflict with data columns.
>
> The special syntax is also more awkward to use. One of the use cases for
this feature is to expose non-schema fields like Kafka topic, offset, and
partition. I think it is much more natural to select `_offset` than
`metadata_column("offset")`. And if these are exposed as columns, we can add
them to `DESCRIBE EXTENDED`, like I've done in this PR.
That aspect of columns is certainly nice.
> > An alternative would be to make the column names use a recognizable
pattern that we can forbid as column names.
>
> I like this suggestion, but I'd leave rejecting column names to sources as
I said above. Another option is to make this available, but not enforce it. We
could add a check that data columns are not of the form `metadata$.*`, but not
require metadata column names using the pattern. That way there is a way to
guarantee names don't conflict, but if a source chooses to use a name that may
conflict it still works fine.
So we could set the example by using this pattern in data sources that we
control, but other data sources could choose otherwise. I guess that works. I'm
personally inclined to be a bit more prescriptive about these things, basically
because if something is a best practice but there's no reviewer for the next
data source that knows about this best practice, then the best practice won't
be followed. And before you know it, there's a data source out in the wild that
you can't migrate to use the best practice because there's already usage out
there. Maybe it's already enough if all the existing data sources consistently
use the pattern -- then at least it's "cargo cult proof".
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]