[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

Andrey Zinovyev (JIRA) Wed, 05 Jun 2019 04:04:44 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856592#comment-16856592
 ]


Andrey Zinovyev commented on SPARK-27913:
-----------------------------------------

Simple way to reproduce it


{code:sql}
create external table test_broken_orc(a struct<f1:int>) stored as orc;
insert into table test_broken_orc select named_struct("f1", 1);
drop table test_broken_orc;
create external table test_broken_orc(a struct<f1:int, f2: int>) stored as orc;
select * from test_broken_orc;
{code}

Last statement fails with exception

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
        at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
{noformat}

Also you can remove column or add column in the middle of struct field. As far 
as I understand current implementation it supports by-name field resolution of 
zero level of orc structure. Everything deeper get resolved by index and 
expected be exact match with reader schema


> Spark SQL's native ORC reader implements its own schema evolution
> -----------------------------------------------------------------
>
>                 Key: SPARK-27913
>                 URL: https://issues.apache.org/jira/browse/SPARK-27913
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3
>            Reporter: Owen O'Malley
>            Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

Reply via email to