[
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enver Osmanov updated SPARK-34435:
----------------------------------
Description:
h5. Actual behavior:
Select column with different case after remapping fail with
ArrayIndexOutOfBoundsException.
h5. Expected behavior:
Spark shouldn't fail with ArrayIndexOutOfBoundsException.
Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")
val ds = Seq(user).toDS().map(identity)
ds.select("aa").show(false)
{code}
h5. Additional notes:
Test case is reproducible with Spark 3.0.1. There is no errors with Spark 2.4.7.
I belive problem could be solved by changing filter in pruneDataSchema method
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
StructType(mergedSchema.filter(f =>
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}
was:
h5. Actual behavior:
Select column with different case after remapping fail with
ArrayIndexOutOfBoundsException.
h5. Expected behavior:
Spark shouldn't fail with ArrayIndexOutOfBoundsException.
Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")
val ds = Seq(user).toDS().map(identity)
ds.select("aa").show(false)
{code}
h5. Additional notes:
Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7.
I belive problem could be solved by changing filter in pruneDataSchema method
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
StructType(mergedSchema.filter(f =>
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}
> ArrayIndexOutOfBoundsException when select in different case
> ------------------------------------------------------------
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
> Issue Type: Bug
> Components: Optimizer, SQL
> Affects Versions: 3.0.1
> Reporter: Enver Osmanov
> Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
> Spark is case insensetive by default, so select should return selected
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There is no errors with Spark
> 2.4.7.
> I belive problem could be solved by changing filter in pruneDataSchema method
> from SchemaPruning object from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
> StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
> StructType(mergedSchema.filter(f =>
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]