[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612643#comment-14612643 ]
Zhan Zhang commented on SPARK-8501: ----------------------------------- Because in spark, we will not create the orc file if the record is empty. It is only happens with the ORC file created by hive, right? > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Environment: Hive 0.13.1 > Reporter: Cheng Lian > Assignee: Cheng Lian > Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/000000_0 > Structure for /user/hive/warehouse_hive13/bar/000000_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/000000_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org