[GitHub] [spark] cloud-fan commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

via GitHub Mon, 24 Apr 2023 07:58:38 -0700


cloud-fan commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175418393



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
    * method. Technically, a file format could choose suppress them, but that 
is not recommended.
    */
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   My preference is to have a simple and clean framework to generate constant 
metadata columns. I think the framework is a bit complicated right now:
   1. FileFormat implementations can define a few constant metadata columns 
(name and data type).
   2. The constant metadata columns should either match a built-in name in 
`FileScanRDD`, or be filled by a custom `FileIndex`. With this PR, there is no 
built-in name. It's either filled by `FileIndex` or has an extractor.
   3. With this PR, FileFormat implementations can provide extractors for 
certain constant metadata columns.
   
   Basically, we get the value of a constant metadata column from two maps: one 
map is filled by `FileIndex` and contains the value directly. The other map is 
filled by the `FileFormat` implementation and contains extractors.
   
   It will be great if we can replace the first map with the second one. But if 
we can't, I'm fine with the current API as we can't avoid name lookup anyway. 
Can we add more comments in the new API, saying that it may only handle part of 
the constant metadata columns? The rest of them should be handled by a custom 
`FileIndex`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

Reply via email to