Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/7520#issuecomment-135392980
One thing to note is that, case sensitivity of Spark SQL is configurable
([see here] [1]). So I don't think we should make `StructType` completely case
insensitive (yet case preserving).
If I understand this issue correctly, the root problem here is that, while
writing schema information to physical ORC files, our current approach isn't
case preserving. As suggested by @chenghao-intel, when saving a DataFrame as
Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive
compatible approach, so that we can read the data back using Hive. This
implies that, changes made in this PR should also be compatible with Hive.
After investigating Hive's behavior for a while, I got some interesting
findings.
Snippets below were executed against Hive 1.2.1 (with a PostgreSQL
metastore) and Spark SQL 1.5-SNAPSHOT ([revision 0eeee5c] [2]). Firstly, let's
prepare a Hive ORC table:
```
hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL;
...
hive> SELECT col FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)
hive> SELECT COL FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)
hive> DESC orc_test;
OK
col int
Time taken: 0.047 seconds, Fetched: 1 row(s)
```
So Hive is neither case sensitive nor case preserving. We can further
prove this by checking metastore table `COLUMN_V2`:
```
metastore_hive121> SELECT * FROM "COLUMNS_V2"
+---------+-----------+---------------+-------------+---------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
|---------+-----------+---------------+-------------+---------------|
| 22 | <null> | col | int | 0 |
+---------+-----------+---------------+-------------+---------------+
```
(I cleared my local Hive warehouse, so the only column record here is the
one created above.)
Now let's read the physical ORC files directly using Spark:
```
scala>
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema()
root
|-- _col0: integer (nullable = true)
scala>
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show()
+-----+
|_col0|
+-----+
| 1|
+-----+
```
Huh? Why it's `_col0` instead of `col`? Let's inspect the physical ORC
file written by Hive:
```
$ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/000000_0
Structure for /user/hive/warehouse_hive121/orc_test/000000_0
File Version: 0.12 with HIVE_8732
15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from
/user/hive/warehouse_hive121/orc_test/000000_0 with {include: null, offset: 0,
length: 9223372036854775807}
15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on
read. Using file schema.
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int> <---- !!!
...
```
Surprise! So, when writing ORC files, *Hive doesn't even preserve the
column names*.
Conclusions:
1. Making `StructType` completely case insensitive is unacceptable.
1. Concrete column names written into ORC files by Spark SQL don't affect
interoperability with Hive.
1. It would be good for Spark SQL to be case preserving when writing ORC
files.
And I think this is the task this PR should aim.
[1]:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L247-L249
[2]:
https://github.com/apache/spark/commit/bb1640529725c6c38103b95af004f8bd90eeee5c
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]