Matt Zhang created SPARK-56931:
----------------------------------
Summary: Support ArrayType/MapType/StructType constant metadata in
row materialization path
Key: SPARK-56931
URL: https://issues.apache.org/jira/browse/SPARK-56931
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Matt Zhang
Fix For: 4.2.0, 5.0.0
Follow-up to SPARK-56844, which enabled ArrayType/MapType/StructType in
FileSourceMetadataAttribute and added the populate() branches for
ConstantColumnVector. That covered the columnar metadata path (ColumnarBatch
output).
For file scans that produce row output (Batched=false: text, JSON, CSV, or any
reader that does not vectorize), the metadata row is filled via
FileFormat.updateMetadataInternalRow -> getFileConstantMetadataColumnValue ->
Literal(extractor.apply(file)).
Literal.apply(Any) dispatches on the value class and has no case for ArrayData
/ MapData / InternalRow, so a complex constant metadata column trips
UNSUPPORTED_FEATURE.LITERAL_TYPE before the row is populated.
Separately, SchemaPruning.sortLeftFieldsByRight recurses through the metadata
schema and prunes nested struct fields inside an array/map/struct subfield.
That is correct for data files (the reader projects the requested columns) but
wrong for constant metadata, where each subfield is produced whole by a single
extractor; pruning shaves catalyst row positions out from under the extractor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]