[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

via GitHub Mon, 10 Apr 2023 06:18:56 -0700


ryan-johnson-databricks commented on code in PR #40677:
URL: https://github.com/apache/spark/pull/40677#discussion_r1161713166



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala:
##########
@@ -23,11 +23,30 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.types.StructType
 
+/**
+ * A file status augmented with optional metadata. File formats can use the 
extra metadata to expose
+ * custom file-constant metadata columns, but in general tasks and readers can 
use the per-file
+ * metadata however they see fit.
+ */
+case class FileStatusWithMetadata(fileStatus: FileStatus, metadata: 
Map[String, Any] = Map.empty) {

Review Comment:
   See my 
[TODO](https://github.com/apache/spark/pull/40677/files/6d35127b60f77475bf4b158b762468f30ec3dd9a#diff-4445cc3828e35092eb261467b499b8b0ef69ae694ea8ce25abf16b8ef4b72fbaR282)
 above... we may need to consider supporting value-producing functions, to 
allow full pruning in cases where the value is somehow expensive to compute. 
Requiring `Literal` would block that (and AFAIK only `Any` could capture both 
`Literal` and `() => Literal`).
   
   The `FILE_PATH` case that calls `Path.toString`, and the call sites of 
PartitionedFile is a small example of that possibility that got me thinking -- 
what if instead of passing length, path, etc as arguments, we just passed the 
actual file status, and used the extractors on it? Probably doesn't make sense 
to _actually_ do that for the hard-wired cases, tho.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

Reply via email to