mzhang opened a new pull request, #55962:
URL: https://github.com/apache/spark/pull/55962
### What changes were proposed in this pull request?
Follow-up to SPARK-56844, which allowed `ArrayType` / `MapType` /
`StructType`
in `FileSourceMetadataAttribute` and added the matching branches to
`ColumnVectorUtils.populate` for the columnar metadata path.
That covered file scans returning `ColumnarBatch`. For scans that produce
row-form output (text, JSON, CSV, or any reader with `Batched=false`), the
metadata row is filled via
`FileFormat.updateMetadataInternalRow` ->
`FileFormat.getFileConstantMetadataColumnValue` ->
`Literal(extractor.apply(file))`.
`Literal.apply(Any)` dispatches on the value class and has no case for
`ArrayData`, `MapData`, or `InternalRow`, so a complex constant metadata
column trips `UNSUPPORTED_FEATURE.LITERAL_TYPE` before the row is even
populated. Separately, `SchemaPruning.sortLeftFieldsByRight` recurses
through the metadata schema and prunes nested struct fields inside an
array/map/struct subfield. That is correct for data files (the reader
projects the requested columns) but wrong for constant metadata, where
each subfield's value is produced whole by a single extractor; pruning
shaves catalyst row positions out from under the extractor.
This PR:
- Threads the column's `DataType` through
`FileFormat.getFileConstantMetadataColumnValue` and
`updateMetadataInternalRow`. When provided, the value goes through
`Literal.create(value, dataType)` which accepts catalyst-form values
directly. The parameter is optional and existing call sites that pass
primitives keep working unchanged.
- Teaches `SchemaPruning.sortLeftFieldsByRight` to preserve subfield
data types when recursing inside a `FileSourceMetadataAttribute`. The
metadata attribute's top struct can still have unused sibling
sub-attributes pruned (each is a separate extractor), but anything
below that level is preserved verbatim. Non-metadata data file
pruning behavior is unchanged.
### Why are the changes needed?
Without this, a file format that registers a constant metadata column
with a complex type (e.g. `array<struct<...>>`) can be read columnar
but fails at runtime on the row path, and even on the columnar path
the schema-pruning rewriter can shift element struct ordinals.
### Does this PR introduce _any_ user-facing change?
No. No current OSS code path exposes a complex constant metadata column.
### How was this patch tested?
New `SchemaPruningSuite` case covers the metadata-attribute preservation
rule. Existing `SchemaPruningSuite` and `FileMetadataStructSuite` tests
verify the non-metadata and sibling-pruning behavior is unchanged.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (Anthropic)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]