[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

via GitHub Mon, 10 Apr 2023 16:03:37 -0700


ryan-johnson-databricks commented on code in PR #40677:
URL: https://github.com/apache/spark/pull/40677#discussion_r1162154097



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala:
##########
@@ -23,11 +23,30 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.types.StructType
 
+/**
+ * A file status augmented with optional metadata. File formats can use the 
extra metadata to expose
+ * custom file-constant metadata columns, but in general tasks and readers can 
use the per-file
+ * metadata however they see fit.
+ */
+case class FileStatusWithMetadata(fileStatus: FileStatus, metadata: 
Map[String, Any] = Map.empty) {

Review Comment:
   Updated the doc comment here to explain that file-source metadata fields is 
only one possible usage for the extra file metadata (which is conceptually at a 
deeper layer than catalyst and `Literal`).
   
   Also updated `isSupportedType` doc comment to explain why not all types are 
supported.
   
   Relevant implementation details: 
   1. It would take a lot of work to support all data types, regardless of 
whether we use `Literal` vs. `Any`.
   2. We anyway end up wrapping the provided value in a call to `Literal(_)`, 
because doing so simplifies null handling by making null-because-missing 
equivalent to null-because-null. At that point, we get wrapping of primitive 
values "for free" if we happen to pass `Any` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

Reply via email to