[ 
https://issues.apache.org/jira/browse/SPARK-55897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Willis updated SPARK-55897:
---------------------------------
    Description: 
{{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} do not handle 
{{{}UserDefinedType{}}}, throwing 
{{SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3155")}} when a UDT 
field is accessed via the interpreted eval path (e.g., 
{{GetStructField.nullSafeEval}} on a nested struct from the vectorized Parquet 
reader).
{code:java}
org.apache.spark.SparkException: [INTERNAL_ERROR] Undefined error message 
parameter for error class:
'_LEGACY_ERROR_TEMP_3155', MessageTemplate: Datatype not supported <dataType>, 
Parameters: Map()
    at org.apache.spark.sql.vectorized.ColumnarRow.get(ColumnarRow.java:221)
    at 
org.apache.spark.sql.catalyst.expressions.GetStructField.nullSafeEval(complexTypeExtractors.scala:207){code}
*This happens when:*
 # The vectorized Parquet reader produces a {{ColumnarBatch}}
 # {{ColumnarToRowExec}} (in WSCG codegen mode) reads top-level columns via 
typed accessors ({{{}getArray{}}}, {{{}getBinary{}}}, etc.), which works fine
 # But for *nested* structures (e.g., {{{}array[0].field{}}}), the top-level 
{{getArray()}} returns a {{{}ColumnarArray{}}}, and indexing into it returns a 
{{ColumnarRow}} — these remain as columnar objects, not copied to {{UnsafeRow}}
 # A downstream expression that *can't be codegenned* (like 
{{InferredExpression}} in PR #611) falls back to {{{}eval(){}}}, which calls 
{{GetStructField.nullSafeEval()}} on the {{ColumnarRow}}
 # {{GetStructField}} passes the UDT type (from the expression's schema) to 
{{{}ColumnarRow.get(){}}}, which doesn't handle UDT → crash

*The codegen path never hits this* because {{CodeGenerator.getValue()}} (line 
1683) resolves {{UserDefinedType }}to {{sqlType}} before generating code, so it 
generates {{getBinary()}} instead of {{{}get(ordinal, UDT){}}}.
h3. Root Cause

{{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} dispatch on {{dataType}} 
via {{instanceof}} checks for all concrete Spark types but have no branch for 
{{{}UserDefinedType{}}}. When {{{}GetStructField.nullSafeEval(){}}}passes a UDT 
type to {{{}get(){}}}, it falls through to the default error branch.

The codegen path is unaffected because {{CodeGenerator.getValue()}} unwraps 
{{udt.sqlType()}} before generating type-specific accessor calls 
({{{}getInt{}}}, {{{}getStruct{}}}, etc.), bypassing {{get()}} entirely. This 
is why the existing SPARK-39086 tests pass — they run through whole-stage 
codegen.

The bug surfaces when the interpreted path is used (codegen disabled, codegen 
fallback, or exceeding {{{}spark.sql.codegen.maxFields{}}}).
h3. Affected Code
 * {{ColumnarRow.java:184-223}} — {{get(int ordinal, DataType dataType)}}
 * {{ColumnarBatchRow.java:179-222}} — {{get(int ordinal, DataType dataType)}}

  was:
{{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} do not handle 
{{{}UserDefinedType{}}}, throwing 
{{SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3155")}} when a UDT 
field is accessed via the interpreted eval path (e.g., 
{{GetStructField.nullSafeEval}} on a nested struct from the vectorized Parquet 
reader).
{code:java}
org.apache.spark.SparkException: [INTERNAL_ERROR] Undefined error message 
parameter for error class:
'_LEGACY_ERROR_TEMP_3155', MessageTemplate: Datatype not supported <dataType>, 
Parameters: Map()
    at org.apache.spark.sql.vectorized.ColumnarRow.get(ColumnarRow.java:221)
    at 
org.apache.spark.sql.catalyst.expressions.GetStructField.nullSafeEval(complexTypeExtractors.scala:207){code}
h3. Root Cause

{{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} dispatch on {{dataType}} 
via {{instanceof}} checks for all concrete Spark types but have no branch for 
{{{}UserDefinedType{}}}. When {{{}GetStructField.nullSafeEval(){}}}passes a UDT 
type to {{{}get(){}}}, it falls through to the default error branch.

The codegen path is unaffected because {{CodeGenerator.getValue()}} unwraps 
{{udt.sqlType()}} before generating type-specific accessor calls 
({{{}getInt{}}}, {{{}getStruct{}}}, etc.), bypassing {{get()}} entirely. This 
is why the existing SPARK-39086 tests pass — they run through whole-stage 
codegen.

The bug surfaces when the interpreted path is used (codegen disabled, codegen 
fallback, or exceeding {{{}spark.sql.codegen.maxFields{}}}).
h3. Affected Code
 * {{ColumnarRow.java:184-223}} — {{get(int ordinal, DataType dataType)}}
 * {{ColumnarBatchRow.java:179-222}} — {{get(int ordinal, DataType dataType)}}


> ColumnarRow.get() and ColumnarBatchRow.get() throw on UserDefinedType
> ---------------------------------------------------------------------
>
>                 Key: SPARK-55897
>                 URL: https://issues.apache.org/jira/browse/SPARK-55897
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.1
>         Environment: I don't think this is hardware-dependent but I 
> discovered this on an M3 Macbook pro.
>            Reporter: James Willis
>            Priority: Major
>
> {{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} do not handle 
> {{{}UserDefinedType{}}}, throwing 
> {{SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3155")}} when a UDT 
> field is accessed via the interpreted eval path (e.g., 
> {{GetStructField.nullSafeEval}} on a nested struct from the vectorized 
> Parquet reader).
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Undefined error message 
> parameter for error class:
> '_LEGACY_ERROR_TEMP_3155', MessageTemplate: Datatype not supported 
> <dataType>, Parameters: Map()
>     at org.apache.spark.sql.vectorized.ColumnarRow.get(ColumnarRow.java:221)
>     at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.nullSafeEval(complexTypeExtractors.scala:207){code}
> *This happens when:*
>  # The vectorized Parquet reader produces a {{ColumnarBatch}}
>  # {{ColumnarToRowExec}} (in WSCG codegen mode) reads top-level columns via 
> typed accessors ({{{}getArray{}}}, {{{}getBinary{}}}, etc.), which works fine
>  # But for *nested* structures (e.g., {{{}array[0].field{}}}), the top-level 
> {{getArray()}} returns a {{{}ColumnarArray{}}}, and indexing into it returns 
> a {{ColumnarRow}} — these remain as columnar objects, not copied to 
> {{UnsafeRow}}
>  # A downstream expression that *can't be codegenned* (like 
> {{InferredExpression}} in PR #611) falls back to {{{}eval(){}}}, which calls 
> {{GetStructField.nullSafeEval()}} on the {{ColumnarRow}}
>  # {{GetStructField}} passes the UDT type (from the expression's schema) to 
> {{{}ColumnarRow.get(){}}}, which doesn't handle UDT → crash
> *The codegen path never hits this* because {{CodeGenerator.getValue()}} (line 
> 1683) resolves {{UserDefinedType }}to {{sqlType}} before generating code, so 
> it generates {{getBinary()}} instead of {{{}get(ordinal, UDT){}}}.
> h3. Root Cause
> {{ColumnarRow.get()}} and {{ColumnarBatchRow.get()}} dispatch on {{dataType}} 
> via {{instanceof}} checks for all concrete Spark types but have no branch for 
> {{{}UserDefinedType{}}}. When {{{}GetStructField.nullSafeEval(){}}}passes a 
> UDT type to {{{}get(){}}}, it falls through to the default error branch.
> The codegen path is unaffected because {{CodeGenerator.getValue()}} unwraps 
> {{udt.sqlType()}} before generating type-specific accessor calls 
> ({{{}getInt{}}}, {{{}getStruct{}}}, etc.), bypassing {{get()}} entirely. This 
> is why the existing SPARK-39086 tests pass — they run through whole-stage 
> codegen.
> The bug surfaces when the interpreted path is used (codegen disabled, codegen 
> fallback, or exceeding {{{}spark.sql.codegen.maxFields{}}}).
> h3. Affected Code
>  * {{ColumnarRow.java:184-223}} — {{get(int ordinal, DataType dataType)}}
>  * {{ColumnarBatchRow.java:179-222}} — {{get(int ordinal, DataType dataType)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to