Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158133877 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... - // $example off:spark_hive$ + /* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ + val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + + /* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + + // to make Hive parquet format compatible with spark parquet format + spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + // Multiple parquet files could be created accordingly to volume of data under directory given. + val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + + // turn on flag for Dynamic Partitioning + spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") + spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + // You can create partitions in Hive table, so downstream queries run much faster. + hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + /* + If Data volume is very huge, then every partitions would have many small-small files which may harm --- End diff -- This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here. What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org