[jira] [Updated] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

Michael Armbrust (JIRA) Wed, 24 Jun 2015 14:52:34 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Armbrust updated SPARK-8501:
------------------------------------
    Target Version/s: 1.5.0, 1.4.2  (was: 1.4.1, 1.5.0)

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-8501
>                 URL: https://issues.apache.org/jira/browse/SPARK-8501
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Hive 0.13.1
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/000000_0
> Structure for /user/hive/warehouse_hive13/bar/000000_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/000000_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

Reply via email to