ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##########
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that
is not recommended.
*/
def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+ /**
+ * The extractors to use when deriving file-constant metadata columns for
this file format.
+ *
+ * By default, the value of a file-constant metadata column is obtained by
looking up the column's
+ * name in the file's metadata column value map. However, implementations
can override this method
+ * in order to provide an extractor that has access to the entire
[[PartitionedFile]] when
+ * deriving the column's value.
+ *
+ * NOTE: Extractors are lazy, invoked only if the query actually selects
their column at runtime.
+ *
+ * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+ */
+ def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =
Review Comment:
Today's API is two pieces:
```scala
/** All fields of the _metadata struct (in declared order) */
def metadataSchemaFields: Seq[StructField]
/** Custom extractors for the subset of file-constant metadata columns that
actually need them */
def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any]
```
Splitting the API would produce instead (at a minimum):
```scala
/** Extractor for file-constant metadata columns whose value comes from the
metadata map */
val DEFAULT_EXTRACTOR: PartitionedFile => Any
/** File-constant fields of the _metadata struct */
def constantMetadataCols: Seq[(StructField, PartitionedFile => Any)]
/** File-generated fields of the _metadata struct */
def generatedMetadataCols: Seq[StructField]
```
The only safety issue I thought of so far, is that it might be possible to
get the name mapping wrong between columns and their extractors. But if that
happened even the most basic unit test for the new column would fail.
Meanwhile, name mapping is anyway needed internally because the list of
metadata columns we have to work with is named attributes or even just column
names extracted from the query plan itself. Thus, the pairing proposed above
would anyway have to be reconstructed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]