[ 
https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-19430:
-----------------------------------
    Description: 
Spark throws an exception when trying to read external tables with VARCHAR 
columns if they're backed by ORC files that were written by Hive 1.2.1 (and 
possibly other versions of hive).

Steps to reproduce (credits to [~lian cheng]):

# Write an ORC table using Hive 1.2.1 with
   {noformat}
CREATE TABLE orc_varchar_test STORED AS ORC
AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
# Get the raw path of the written ORC file
# Create an external table pointing to this file and read the table using Spark
  {noformat}
val path = "/tmp/orc_varchar_test"
sql(s"create external table if not exists test (c0 varchar(10)) stored as orc 
location '$path'")
spark.table("test").show(){noformat}

The problem here is that the metadata in the ORC file written by Hive is 
different from those written by Spark. We can inspect the ORC file written 
above:
{noformat}
$ hive --orcfiledump 
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
Structure for 
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:varchar(10)>       <----
...
{noformat}
On the other hand, if you create an ORC table using the same DDL and inspect 
the written ORC file, you'll see:
{noformat}
...
Type: struct<c0:string>
...
{noformat}

Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set to 
{{false}}, which is the default case.

I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the 
following error:

{code}
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
    at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
    at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Spark throws an exception when trying to read external tables with VARCHAR 
columns if they're backed by ORC files that were written by Hive 1.2.1.

Steps to reproduce (credits to [~lian cheng]):

# Write an ORC table using Hive 1.2.1 with
   {noformat}
CREATE TABLE orc_varchar_test STORED AS ORC
AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
# Get the raw path of the written ORC file
# Create an external table pointing to this file and read the table using Spark
  {noformat}
val path = "/tmp/orc_varchar_test"
sql(s"create external table if not exists test (c0 varchar(10)) stored as orc 
location '$path'")
spark.table("test").show(){noformat}

The problem here is that the metadata in the ORC file written by Hive is 
different from those written by Spark. We can inspect the ORC file written 
above:
{noformat}
$ hive --orcfiledump 
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
Structure for 
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:varchar(10)>       <----
...
{noformat}
On the other hand, if you create an ORC table using the same DDL and inspect 
the written ORC file, you'll see:
{noformat}
...
Type: struct<c0:string>
...
{noformat}

Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set to 
{{false}}, which is the default case.

I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the 
following error:

{code}
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
    at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
    at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
    at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
{code}


> Cannot read external tables with VARCHAR columns if they're backed by ORC 
> files written by Hive 1.2.1
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19430
>                 URL: https://issues.apache.org/jira/browse/SPARK-19430
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.3, 2.0.2, 2.1.0
>            Reporter: Sameer Agarwal
>
> Spark throws an exception when trying to read external tables with VARCHAR 
> columns if they're backed by ORC files that were written by Hive 1.2.1 (and 
> possibly other versions of hive).
> Steps to reproduce (credits to [~lian cheng]):
> # Write an ORC table using Hive 1.2.1 with
>    {noformat}
> CREATE TABLE orc_varchar_test STORED AS ORC
> AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
> # Get the raw path of the written ORC file
> # Create an external table pointing to this file and read the table using 
> Spark
>   {noformat}
> val path = "/tmp/orc_varchar_test"
> sql(s"create external table if not exists test (c0 varchar(10)) stored as orc 
> location '$path'")
> spark.table("test").show(){noformat}
> The problem here is that the metadata in the ORC file written by Hive is 
> different from those written by Spark. We can inspect the ORC file written 
> above:
> {noformat}
> $ hive --orcfiledump 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
> Structure for 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
> File Version: 0.12 with HIVE_8732
> Rows: 1
> Compression: ZLIB
> Compression size: 262144
> Type: struct<_col0:varchar(10)>       <----
> ...
> {noformat}
> On the other hand, if you create an ORC table using the same DDL and inspect 
> the written ORC file, you'll see:
> {noformat}
> ...
> Type: struct<c0:string>
> ...
> {noformat}
> Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set 
> to {{false}}, which is the default case.
> I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of 
> the following error:
> {code}
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>     at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at org.apache.spark.scheduler.Task.run(Task.scala:99)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to