[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

srowen Wed, 20 Dec 2017 13:00:07 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158133877
  
    --- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
    +     * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of 
data under directory given.
    +    val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
    +    
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run 
much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many 
small-small files which may harm
    --- End diff --
    
    This is more stuff that should go in docs, not comments in an example. It 
kind of duplicates existing documentation. Is this commentary really needed to 
illustrate usage of the API? that's the only goal right here. 
    
    What are small-small files? You have some inconsistent capitalization; 
Parquet should be capitalized but not file, bandwidth, etc.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Reply via email to