[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

GitBox Wed, 04 Nov 2020 13:51:26 -0800


bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517651552




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       > @bart-samwel, I like your last suggestion, but with a slight change: I 
think sources should reject column names that conflict with their metadata 
columns. That way, a source can choose a format that makes sense (e.g., 
__offset) and allow everything else. All Spark has to do is to define what the 
behavior will be when there is a conflict, and recommend that sources reject 
column names that are also exposed as metadata columns.
   
   They probably should reject such column names, yes. Especially at write time 
-- you don't want to be in a situation where you wrote a file and can then 
never read it back. But if you did that, how would you deal with it? Could you 
never read it with Spark? One alternative is to just hide the column if it's in 
the file. You wouldn't be able to access it, but at least that would satisfy 
the "auditing library requirement" that the metadata fields always do the 
expected thing. Another alternative would be to expose the column under a new 
name. But what if that new name is also there? It gets very annoying very 
quickly. So maybe just hiding is the best policy. 
   
   > I think using a special syntax requires significantly more changes, so I 
don't think that's a good option. Specific replies are below.
   > 
   > > If I'd write table._some_metadata_column in an auditing-related library, 
I'd like to be sure that I get what I asked for, and that this cannot be 
spoofed.
   > 
   > I think this is a fair argument, but I expect it would very rarely be a 
problem that you wanted a metadata column and accidentally projected a data 
column.
   
   It'll be rare, but if I have an auditing library that e.g. inserts a special 
execution node on top of a scan to track exactly what was read, then it would 
be great if people can't spoof that by adding a column with a particular 
special name to their underlying files. If the internal metadata column names 
take precedence over the columns in the file, then that problem is not there. 
(FWIW, I usually try to apply the principle of "name resolution must be 
environment independent", i.e., if a query defines something locally, e.g. an 
alias, then something in the environment (such as a table's schema) shouldn't 
be able to make that aspect of the query change meaning, because then things 
will break randomly as a result of schema changes. This fits in the same 
category.)
   
   > > How about using special syntax
   > 
   > This is what I considered above, "we could use a function to signal that a 
column is a metadata column only, and keep the interface to sources the same . 
. ." I think the challenge here is carrying the information through analysis 
and optimization. This approach either has many of the same complications as 
using a function (if kept as an expression), or we would need to introduce a 
new attribute type for metadata attrs that would hopefully work in rules just 
like normal attributes in existing rules. We can't lose the fact that the 
attribute references metadata, so we can't replace a metadata attribute 
reference with an attribute reference.
   
   Yeah, I was thinking about the special AttributeReference type. But more in 
the way of "use a very special naming convention that you can't type" -- maybe 
because we'd reject it in the parser if you'd try to explicitly write an 
identifier like that. We'd have to do something to prevent that column name 
from showing up in output files etc. though, so it's not entirely foolproof.
   
   > I think the option most likely to not introduce more problems is to use a 
regular attribute for these columns, but that eliminates the benefit of a 
special syntax to track the fact that they are metadata column references. This 
doesn't seem like a good option to me, if all we want to do is guarantee that 
metadata columns don't conflict with data columns.
   > 
   > The special syntax is also more awkward to use. One of the use cases for 
this feature is to expose non-schema fields like Kafka topic, offset, and 
partition. I think it is much more natural to select `_offset` than 
`metadata_column("offset")`. And if these are exposed as columns, we can add 
them to `DESCRIBE EXTENDED`, like I've done in this PR.
   
   That aspect of columns is certainly nice.
   
   > > An alternative would be to make the column names use a recognizable 
pattern that we can forbid as column names.
   > 
   > I like this suggestion, but I'd leave rejecting column names to sources as 
I said above. Another option is to make this available, but not enforce it. We 
could add a check that data columns are not of the form `metadata$.*`, but not 
require metadata column names using the pattern. That way there is a way to 
guarantee names don't conflict, but if a source chooses to use a name that may 
conflict it still works fine.
   
   So we could set the example by using this pattern in data sources that we 
control, but other data sources could choose otherwise. I guess that works. I'm 
personally inclined to be a bit more prescriptive about these things, basically 
because if something is a best practice but there's no reviewer for the next 
data source that knows about this best practice, then the best practice won't 
be followed. And before you know it, there's a data source out in the wild that 
you can't migrate to use the best practice because there's already usage out 
there. Maybe it's already enough if all the existing data sources consistently 
use the pattern -- then at least it's "cargo cult proof".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Reply via email to