[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

liancheng Thu, 27 Aug 2015 04:23:07 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7520#issuecomment-135392980
  
    One thing to note is that, case sensitivity of Spark SQL is configurable 
([see here] [1]). So I don't think we should make `StructType` completely case 
insensitive (yet case preserving).
    
    If I understand this issue correctly, the root problem here is that, while 
writing schema information to physical ORC files, our current approach isn't 
case preserving.  As suggested by @chenghao-intel, when saving a DataFrame as 
Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive 
compatible approach, so that we can read the data back using Hive.  This 
implies that, changes made in this PR should also be compatible with Hive.  
After investigating Hive's behavior for a while, I got some interesting 
findings.
    
    Snippets below were executed against Hive 1.2.1 (with a PostgreSQL 
metastore) and Spark SQL 1.5-SNAPSHOT ([revision 0eeee5c] [2]).  Firstly, let's 
prepare a Hive ORC table:
    
    ```
    hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL;
    ...
    hive> SELECT col FROM orc_test;
    OK
    1
    Time taken: 0.056 seconds, Fetched: 1 row(s)
    hive> SELECT COL FROM orc_test;
    OK
    1
    Time taken: 0.056 seconds, Fetched: 1 row(s)
    hive> DESC orc_test;
    OK
    col                     int
    Time taken: 0.047 seconds, Fetched: 1 row(s)
    ```
    
    So Hive is neither case sensitive nor case preserving.  We can further 
prove this by checking metastore table `COLUMN_V2`:
    
    ```
    metastore_hive121> SELECT * FROM "COLUMNS_V2"
    +---------+-----------+---------------+-------------+---------------+
    |   CD_ID |   COMMENT | COLUMN_NAME   | TYPE_NAME   |   INTEGER_IDX |
    |---------+-----------+---------------+-------------+---------------|
    |      22 |    <null> | col           | int         |             0 |
    +---------+-----------+---------------+-------------+---------------+
    ```
    
    (I cleared my local Hive warehouse, so the only column record here is the 
one created above.)
    
    Now let's read the physical ORC files directly using Spark:
    
    ```
    scala> 
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema()
    root
     |-- _col0: integer (nullable = true)
    
    scala> 
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show()
    +-----+
    |_col0|
    +-----+
    |    1|
    +-----+
    ```
    
    Huh? Why it's `_col0` instead of `col`?  Let's inspect the physical ORC 
file written by Hive:
    
    ```
    $ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/000000_0
    
    Structure for /user/hive/warehouse_hive121/orc_test/000000_0
    File Version: 0.12 with HIVE_8732
    15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse_hive121/orc_test/000000_0 with {include: null, offset: 0, 
length: 9223372036854775807}
    15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on 
read. Using file schema.
    Rows: 1
    Compression: ZLIB
    Compression size: 262144
    Type: struct<_col0:int>         <---- !!!
    ...
    ```
    
    Surprise!  So, when writing ORC files, *Hive doesn't even preserve the 
column names*.
    
    Conclusions:
    
    1.  Making `StructType` completely case insensitive is unacceptable.
    1.  Concrete column names written into ORC files by Spark SQL don't affect 
interoperability with Hive.
    1.  It would be good for Spark SQL to be case preserving when writing ORC 
files.
    
        And I think this is the task this PR should aim.
    
    [1]: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L247-L249
    [2]: 
https://github.com/apache/spark/commit/bb1640529725c6c38103b95af004f8bd90eeee5c



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

Reply via email to