[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

via GitHub Fri, 21 Apr 2023 06:19:47 -0700


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1173758526



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
    * method. Technically, a file format could choose suppress them, but that 
is not recommended.
    */
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * A scanner must derive each file-constant metadata field's value from each 
[[PartitionedFile]]
+   * it processes. By default, the value is obtained by a direct lookup of the 
column's name on
+   * [[PartitionedFile.otherConstantMetadataColumnValues]] (see
+   * [[FileFormat.getFileConstantMetadataColumnValue]]). However, 
implementations can override this
+   * method in order to provide more sophisticated lazy extractors (e.g. in 
case the column value is
+   * complicated or expensive to compute).

Review Comment:
   I thought I _did_ describe it explicitly:
   1. If you provide an extractor, the extractor has access to all state in the 
`PartitionedFile` (including the column value map) and can do any computations 
it needs to.
   2. Otherwise, the column's value is fetched from the column value map.
   
   Was it not clear [enough] in the comment that the extractor has access to 
the entire `PartitionedFile` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

Reply via email to