ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1173758526
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that
is not recommended.
*/
def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+ /**
+ * The extractors to use when deriving file-constant metadata columns for
this file format.
+ *
+ * A scanner must derive each file-constant metadata field's value from each
[[PartitionedFile]]
+ * it processes. By default, the value is obtained by a direct lookup of the
column's name on
+ * [[PartitionedFile.otherConstantMetadataColumnValues]] (see
+ * [[FileFormat.getFileConstantMetadataColumnValue]]). However,
implementations can override this
+ * method in order to provide more sophisticated lazy extractors (e.g. in
case the column value is
+ * complicated or expensive to compute).
Review Comment:
I thought I _did_ describe it explicitly:
1. If you provide an extractor, the extractor has access to all state in the
`PartitionedFile` (including the column value map) and can do any computations
it needs to.
2. Otherwise, the column's value is fetched from the column value map.
Was it not clear [enough] in the comment that the extractor has access to
the entire `PartitionedFile` ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]