Matt Zhang created SPARK-56931:
----------------------------------

             Summary: Support ArrayType/MapType/StructType constant metadata in 
row materialization path
                 Key: SPARK-56931
                 URL: https://issues.apache.org/jira/browse/SPARK-56931
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Matt Zhang
             Fix For: 4.2.0, 5.0.0


Follow-up to SPARK-56844, which enabled ArrayType/MapType/StructType in 
FileSourceMetadataAttribute and added the populate() branches for 
ConstantColumnVector. That covered the columnar metadata path (ColumnarBatch 
output).

For file scans that produce row output (Batched=false: text, JSON, CSV, or any 
reader that does not vectorize), the metadata row is filled via 
FileFormat.updateMetadataInternalRow -> getFileConstantMetadataColumnValue -> 
Literal(extractor.apply(file)).

Literal.apply(Any) dispatches on the value class and has no case for ArrayData 
/ MapData / InternalRow, so a complex constant metadata column trips 
UNSUPPORTED_FEATURE.LITERAL_TYPE before the row is populated.

Separately, SchemaPruning.sortLeftFieldsByRight recurses through the metadata 
schema and prunes nested struct fields inside an array/map/struct subfield. 
That is correct for data files (the reader projects the requested columns) but 
wrong for constant metadata, where each subfield is produced whole by a single 
extractor; pruning shaves catalyst row positions out from under the extractor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to