[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

via GitHub Mon, 10 Apr 2023 07:41:34 -0700


ryan-johnson-databricks commented on code in PR #40677:
URL: https://github.com/apache/spark/pull/40677#discussion_r1161778302



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala:
##########
@@ -23,11 +23,30 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.types.StructType
 
+/**
+ * A file status augmented with optional metadata. File formats can use the 
extra metadata to expose
+ * custom file-constant metadata columns, but in general tasks and readers can 
use the per-file
+ * metadata however they see fit.
+ */
+case class FileStatusWithMetadata(fileStatus: FileStatus, metadata: 
Map[String, Any] = Map.empty) {

Review Comment:
   Update: I remember now another reason why I had added `isSupportedDataType` 
-- `ConstantColumnVector` (needed by 
[FileScanRDD...createMetadataColumnVector](https://github.com/apache/spark/pull/40677/files/6d35127b60f77475bf4b158b762468f30ec3dd9a#diff-b7b097f8cec6a7ae6640f9ecd6d4ac14ed304ad7c8db802a3ec2c0983535e157R200)
 below) supports a limited subset of types, and relies on type-specific getters 
and setters. Even if I wrote the (complex recursive) code to handle structs, 
maps, and arrays... we still wouldn't have complete coverage for all types.
   
   Do we know for certain that `ConstantColumnVector` supports all types that 
can ever be encountered during vectorized execution? If not, we must keep the 
`isSupportedDataType` method I introduced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40677: [SPARK-43039][SQL] Support custom fields in the file source _metadata column.

Reply via email to