Re: [PR] [SPARK-45815][SQL][Streaming] Provide an interface for other Streaming sources to add `_metadata` columns [spark]

via GitHub Thu, 09 Nov 2023 00:39:44 -0800


Yaohua628 commented on code in PR #43692:
URL: https://github.com/apache/spark/pull/43692#discussion_r1387657438



##########
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##########
@@ -309,3 +309,22 @@ trait InsertableRelation {
 trait CatalystScan {
   def buildScan(requiredColumns: Seq[Attribute], filters: Seq[Expression]): 
RDD[Row]
 }
+
+/**
+ * Implemented by StreamSourceProvider objects that can generate file metadata 
columns.
+ */
+trait SupportsStreamSourceMetadataColumns extends StreamSourceProvider {
+
+  /**
+   * Returns the metadata columns that should be added to the schema of the 
Stream Data Source.
+   *
+   * @param spark The SparkSession used for the operation.
+   * @param options A map of options of the Stream Data Source.
+   * @param userSpecifiedSchema An optional user-provided schema of the Stream 
Data Source.
+   * @return A Seq of AttributeReference representing the metadata output 
attributes.

Review Comment:
   I still prefer to use `Seq[AttributeReference]` here. We define the metadata 
output interface as a `Seq[Attributes]` 
([here](https://github.com/apache/spark/blob/06d8cbe073499ff16bca3165e2de1192daad3984/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L45)).
 Upstream, we generate the file metadata columns as a 
`Seq[AttributeReferences]` 
([here](https://github.com/apache/spark/blob/06d8cbe073499ff16bca3165e2de1192daad3984/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L171)),
 and downstream, methods also require input as a `Seq[AttributeReference]` 
([here](https://github.com/apache/spark/blob/06d8cbe073499ff16bca3165e2de1192daad3984/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L428)).
 
   
   However, I am open to your suggestions. I've updated the doc for the 
interface to clarify and be more explicit about its usage. Hopefully, this 
could be more clear!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45815][SQL][Streaming] Provide an interface for other Streaming sources to add `_metadata` columns [spark]

Reply via email to