Sameer Agarwal created SPARK-19430:
--------------------------------------
Summary: Cannot read external tables with VARCHAR columns if
they're backed by ORC files written by Hive 1.2.1
Key: SPARK-19430
URL: https://issues.apache.org/jira/browse/SPARK-19430
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3
Reporter: Sameer Agarwal
Spark throws an exception when trying to read external tables with VARCHAR
columns if they're backed by ORC files that were written by Hive 1.2.1.
Steps to reproduce (credits to [~lian cheng]):
# Write an ORC table using Hive 1.2.1 with
{noformat}
CREATE TABLE orc_varchar_test STORED AS ORC
AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
# Get the raw path of the written ORC file
# Create an external table pointing to this file and read the table using Spark
{noformat}
val path = "/tmp/orc_varchar_test"
sql(s"create external table if not exists test (c0 varchar(10)) stored as orc
location '$path'")
spark.table("test").show(){noformat}
The problem here is that the metadata in the ORC file written by Hive is
different from those written by Spark. We can inspect the ORC file written
above:
{noformat}
$ hive --orcfiledump
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
Structure for
file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:varchar(10)> <----
...
{noformat}
On the other hand, if you create an ORC table using the same DDL and inspect
the written ORC file, you'll see:
{noformat}
...
Type: struct<c0:string>
...
{noformat}
Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set to
{{false}}, which is the default case.
I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the
following error:
{code}
java.lang.ClassCastException:
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to
org.apache.hadoop.io.Text
at
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]