[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

srowen Wed, 20 Dec 2017 13:00:10 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20018#discussion_r158134032
  
    --- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
    @@ -102,8 +101,63 @@ object SparkHiveExample {
         // |  4| val_4|  4| val_4|
         // |  5| val_5|  5| val_5|
         // ...
    -    // $example off:spark_hive$
     
    +    /*
    +     * Save DataFrame to Hive Managed table as Parquet format
    +     * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
    +     * warehouse location will be used to store Hive table Data.
    +     * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
    +     * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
    +     * location given while creating schema.
    +     * 2. Create Hive Managed table with storage format as 'Parquet'
    +     * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
    +     */
    +    val hiveTableDF = sql("SELECT * FROM records").toDF()
    +    
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
    +
    +    /*
    +     * Save DataFrame to Hive External table as compatible parquet format.
    +     * 1. Create Hive External table with storage format as parquet.
    +     * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
    +     * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
    +     * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
    +     * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
    +     * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
    +     */
    +
    +    // to make Hive parquet format compatible with spark parquet format
    +    spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
    +    // Multiple parquet files could be created accordingly to volume of 
data under directory given.
    +    val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
    +    
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
    +
    +    // turn on flag for Dynamic Partitioning
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
    +    spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
    +    // You can create partitions in Hive table, so downstream queries run 
much faster.
    +    hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
    +      .parquet(hiveExternalTableLocation)
    +    /*
    +    If Data volume is very huge, then every partitions would have many 
small-small files which may harm
    +    downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
    +    To improve performance you can create single parquet file under each 
partition directory using 'repartition'
    +    on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
    +    Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
    +     */
    +    hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
    +      .partitionBy("key").parquet(hiveExternalTableLocation)
    +
    +    /*
    +     You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
    +     data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
    --- End diff --
    
    Sentences need some cleanup here. What do you mean by 'Int' argument? maybe 
it's best to point people to the API docs rather than incompletely repeat it.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

Reply via email to