[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

via GitHub Mon, 24 Apr 2023 06:32:27 -0700


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
    * method. Technically, a file format could choose suppress them, but that 
is not recommended.
    */
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   Today's API is two pieces:
   ```scala
   /** All fields of the _metadata struct (in declared order) */
   def metadataSchemaFields: Seq[StructField]
   /** Custom extractors for the subset of file-constant metadata columns that 
actually need them */
   def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any]
   ```
   Splitting the API would produce instead (at a minimum):
   ```scala
   /** Extractor for file-constant metadata columns whose value comes from the 
metadata map */
   val DEFAULT_EXTRACTOR: PartitionedFile => Any
   /** File-constant fields of the _metadata struct */
   def constantMetadataCols: Seq[(StructField, PartitionedFile => Any)]
   /** File-generated fields of the _metadata struct */
   def generatedMetadataCols: Seq[StructField]
   ```
   The only safety issue I thought of so far, is that it might be possible to 
get the name mapping wrong between columns and their extractors. But if that 
happened even the most basic unit test for the new column would fail. 
Meanwhile, name mapping is anyway needed internally because the list of 
metadata columns we have to work with is named attributes or even just column 
names extracted from the query plan itself. Thus, the pairing proposed above 
would anyway have to be reconstructed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

Reply via email to