ryan-johnson-databricks commented on code in PR #40677:
URL: https://github.com/apache/spark/pull/40677#discussion_r1161713166
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala:
##########
@@ -23,11 +23,30 @@ import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.types.StructType
+/**
+ * A file status augmented with optional metadata. File formats can use the
extra metadata to expose
+ * custom file-constant metadata columns, but in general tasks and readers can
use the per-file
+ * metadata however they see fit.
+ */
+case class FileStatusWithMetadata(fileStatus: FileStatus, metadata:
Map[String, Any] = Map.empty) {
Review Comment:
See my
[TODO](https://github.com/apache/spark/pull/40677/files/6d35127b60f77475bf4b158b762468f30ec3dd9a#diff-4445cc3828e35092eb261467b499b8b0ef69ae694ea8ce25abf16b8ef4b72fbaR282)
above... we may need to consider supporting value-producing functions, to
allow full pruning in cases where the value is somehow expensive to compute.
Requiring `Literal` would block that (and AFAIK only `Any` could capture both
`Literal` and `() => Literal`).
The `FILE_PATH` case that calls `Path.toString`, and the call sites of
PartitionedFile is a small example of that possibility that got me thinking --
what if instead of passing length, path, etc as arguments, we just passed the
actual file status, and used the extractors on it? Probably doesn't make sense
to _actually_ do that for the hard-wired cases, tho.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]