[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

GitBox Wed, 04 Nov 2020 13:25:19 -0800


rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517639689




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       @bart-samwel, I like your last suggestion, but with a slight change: I 
think sources should reject column names that conflict with their metadata 
columns. That way, a source can choose a format that makes sense (e.g., 
__offset) and allow everything else. All Spark has to do is to define what the 
behavior will be when there is a conflict, and recommend that sources reject 
column names that are also exposed as metadata columns.
   
   I think using a special syntax requires significantly more changes, so I 
don't think that's a good option. Specific replies are below.
   
   > If I'd write table._some_metadata_column in an auditing-related library, 
I'd like to be sure that I get what I asked for, and that this cannot be 
spoofed.
   
   I think this is a fair argument, but I expect it would very rarely be a 
problem that you wanted a metadata column and accidentally projected a data 
column.
   
   > How about using special syntax
   
   This is what I considered above, "we could use a function to signal that a 
column is a metadata column only, and keep the interface to sources the same . 
. ." I think the challenge here is carrying the information through analysis 
and optimization. This approach either has many of the same complications as 
using a function (if kept as an expression), or we would need to introduce a 
new attribute type for metadata attrs that would hopefully work in rules just 
like normal attributes in existing rules. We can't lose the fact that the 
attribute references metadata, so we can't replace a metadata attribute 
reference with an attribute reference.
   
   I think the option most likely to not introduce more problems is to use a 
regular attribute for these columns, but that eliminates the benefit of a 
special syntax to track the fact that they are metadata column references. This 
doesn't seem like a good option to me, if all we want to do is guarantee that 
metadata columns don't conflict with data columns.
   
   The special syntax is also more awkward to use. One of the use cases for 
this feature is to expose non-schema fields like Kafka topic, offset, and 
partition. I think it is much more natural to select `_offset` than 
`metadata_column("offset")`. And if these are exposed as columns, we can add 
them to `DESCRIBE EXTENDED`, like I've done in this PR.
   
   > An alternative would be to make the column names use a recognizable 
pattern that we can forbid as column names.
   
   I like this suggestion, but I'd leave rejecting column names to sources as I 
said above. Another option is to make this available, but not enforce it. We 
could add a check that data columns are not of the form `metadata$.*`, but not 
require metadata column names using the pattern. That way there is a way to 
guarantee names don't conflict, but if a source chooses to use a name that may 
conflict it still works fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Reply via email to