cloud-fan commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175418393
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that
is not recommended.
*/
def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+ /**
+ * The extractors to use when deriving file-constant metadata columns for
this file format.
+ *
+ * By default, the value of a file-constant metadata column is obtained by
looking up the column's
+ * name in the file's metadata column value map. However, implementations
can override this method
+ * in order to provide an extractor that has access to the entire
[[PartitionedFile]] when
+ * deriving the column's value.
+ *
+ * NOTE: Extractors are lazy, invoked only if the query actually selects
their column at runtime.
+ *
+ * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+ */
+ def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =
Review Comment:
My preference is to have a simple and clean framework to generate constant
metadata columns. I think the framework is a bit complicated right now:
1. FileFormat implementations can define a few constant metadata columns
(name and data type).
2. The constant metadata columns should either match a built-in name in
`FileScanRDD`, or be filled by a custom `FileIndex`. With this PR, there is no
built-in name. It's either filled by `FileIndex` or has an extractor.
3. With this PR, FileFormat implementations can provide extractors for
certain constant metadata columns.
Basically, we get the value of a constant metadata column from two maps: one
map is filled by `FileIndex` and contains the value directly. The other map is
filled by the `FileFormat` implementation and contains extractors.
It will be great if we can replace the first map with the second one. But if
we can't, I'm fine with the current API as we can't avoid name lookup anyway.
Can we add more comments in the new API, saying that it may only handle part of
the constant metadata columns? The rest of them should be handled by a custom
`FileIndex`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]