[
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-8501:
------------------------------------
Target Version/s: 1.5.0, 1.4.2 (was: 1.4.1, 1.5.0)
> ORC data source may give empty schema if an ORC file containing zero rows is
> picked for schema discovery
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0
> Environment: Hive 0.13.1
> Reporter: Cheng Lian
> Assignee: Cheng Lian
> Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive
> 0.13.1: for an ORC file containing zero rows, the schema written in its
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file. Copy data
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/000000_0
> Structure for /user/hive/warehouse_hive13/bar/000000_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from
> /user/hive/warehouse_hive13/bar/000000_0 with {include: null, offset: 0,
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
> Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table
> schema. But for users who read raw data files without Hive metastore with
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]