[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

dongjoon-hyun Tue, 27 Nov 2018 14:07:48 -0800

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23108#discussion_r236866087
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
     
       - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
     
    +  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, if you specify the wild card, it will try to read the matching data 
from current directory and sub-directory, if you specify a directory which does 
not conta
 ins the matching data, native ORC reader will not be able to read, even the 
data is in the sub-directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location 
'/tmp/orctab1/'` will not read the data from sub-directory into the table.  To 
set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous 
behavior.
    --- End diff --
    
    I've read this again. In fact, this is not a new behavior for Spark users 
because Apache Spark uses Parquet as a default format since 2.0 and the default 
behavior of `STORED AS PARQUET` works like this.
    
    In order to give the rich context to the users and to avoid irrelevant 
confusions, we had better merge this part into the above line (line 112). For 
example, I'd like to update line 112 like the following.
    
    > applied. **In addition, this makes Spark's Hive table behavior more 
consistent over different formats. For example, for both ORC/Parquet Hive 
tables, `LOCATION '/table/*'` is required instead of `LOCATION '/table/'` to 
create an external table reading its direct sub-directories.** To set `false` 
to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

Reply via email to