[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #937: [HUDI-291] Simplify quickstart documentation

GitBox Thu, 03 Oct 2019 15:15:14 -0700

vinothchandar commented on a change in pull request #937: [HUDI-291] Simplify 
quickstart documentation
URL: https://github.com/apache/incubator-hudi/pull/937#discussion_r331276441


 ##########
 File path: docs/quickstart.md
 ##########
 @@ -3,196 +3,186 @@ title: Quickstart
 keywords: hudi, quickstart
 tags: [quickstart]
 sidebar: mydoc_sidebar
-toc: false
+toc: true
 permalink: quickstart.html
 ---
 <br/>
-To get a quick peek at Hudi's capabilities, we have put together a [demo 
video](https://www.youtube.com/watch?v=VhNgUsxdrD0) 
-that showcases this on a docker based setup with all dependent systems running 
locally. We recommend you replicate the same setup 
-and run the demo yourself, by following steps [here](docker_demo.html). Also, 
if you are looking for ways to migrate your existing data to Hudi, 
-refer to [migration guide](migration_guide.html).
 
-If you have Hive, Hadoop, Spark installed already & prefer to do it on your 
own setup, read on.
+This guide provides a quick peak at Hudi's capabilities using simple 
spark-shell. Using Spark datasources, this guide 
+walks through code snippets that allows you to insert and update a Hudi table 
of default Storage type: 
+  [Copy on 
Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
+After each write operation we show how to read the data. We will also be 
looking at how to query a Hudi table incrementally. 
 
-## Download Hudi
+We have put together a [demo 
video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that showcases this on a 
docker based 
+setup with all dependent systems running locally. We recommend you replicate 
the same setup and run the demo yourself, 
+by following steps [here](docker_demo.html). Also, if you are looking for ways 
to migrate your existing data to Hudi, 
+refer to [migration guide](migration_guide.html). 
 
-Check out [code](https://github.com/apache/incubator-hudi) and normally build 
the maven project, from command line
+For the quickstart, you would need to build Hudi spark bundle jar and provide 
that to the spark shell as shown below.
 
-```
-$ mvn clean install -DskipTests -DskipITs
-```
-
-Hudi works with Hive 2.3.x or higher versions. As long as Hive 2.x protocol 
can talk to Hive 1.x, you can use Hudi to 
-talk to older hive versions.
-
-For IDE, you can pull in the code into IntelliJ as a normal maven project. 
-You might want to add your spark jars folder to project dependencies under 
'Module Setttings', to be able to run from IDE.
-
-
-### Version Compatibility
+## Build Hudi spark bundle jar
 
-Hudi requires Java 8 to be installed on a *nix system. Hudi works with 
Spark-2.x versions. 
-Further, we have verified that Hudi works with the following combination of 
Hadoop/Hive/Spark.
-
-| Hadoop | Hive  | Spark | Instructions to Build Hudi |
-| ---- | ----- | ---- | ---- |
-| Apache hadoop-2.[7-8].x | Apache hive-2.3.[1-3] | spark-2.[1-3].x | Use "mvn 
clean install -DskipTests" |
-
-If your environment has other versions of hadoop/hive/spark, please try out 
Hudi 
-and let us know if there are any issues. 
-
-## Generate Sample Dataset
-
-### Environment Variables
-
-Please set the following environment variables according to your setup. We 
have given an example setup with CDH version
+Hudi requires Java 8 to be installed on a *nix system.
+Check out [code](https://github.com/apache/incubator-hudi) and normally build 
the maven project, from command line:
 
 ```
-cd incubator-hudi 
-export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
-export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin
-export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_INSTALL=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
-export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
-export SPARK_HOME=/var/hadoop/setup/spark-2.3.1-bin-hadoop2.7
-export SPARK_INSTALL=$SPARK_HOME
-export SPARK_CONF_DIR=$SPARK_HOME/conf
-export 
PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
+$ mvn clean install -DskipTests -DskipITs
+
+$ # Export the location of hudi-spark-bundle for later reference 
+$ mkdir -p /var/tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/var/tmp/hudi/hudi-spark-bundle.jar 
+$ export HUDI_SPARK_BUNDLE_PATH=/var/tmp/hudi/hudi-spark-bundle.jar
 ```
 
-### Run HoodieJavaApp
+## Setup spark-shell
+Hudi works with Spark-2.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for 
+setting up spark. 
 
-Run __hudi-spark/src/test/java/HoodieJavaApp.java__ class, to place a two 
commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously 
inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
-to run from command-line
+From the extracted directory run spark-shell with Hudi as:
 
 ```
-cd hudi-spark
-./run_hoodie_app.sh --help
-Usage: <main class> [options]
-  Options:
-    --help, -h
-       Default: false
-    --table-name, -n
-       table name for Hudi sample table
-       Default: hoodie_rt
-    --table-path, -p
-       path for Hudi sample table
-       Default: file:///tmp/hoodie/sample-table
-    --table-type, -t
-       One of COPY_ON_WRITE or MERGE_ON_READ
-       Default: COPY_ON_WRITE
+bin/spark-shell --jars $HUDI_SPARK_BUNDLE_PATH --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
-The class lets you choose table names, output paths and one of the storage 
types. In your own applications, be sure to include the `hudi-spark` module as 
dependency
-and follow a similar pattern to write/read datasets via the datasource. 
-
-## Query a Hudi dataset
-
-Next, we will register the sample dataset into Hive metastore and try to query 
using [Hive](#hive), [Spark](#spark) & [Presto](#presto)
+Setup table name, base path and a data generator to generate records for this 
guide.
 
-### Start Hive Server locally
+```
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.DataSourceReadOptions._
 
+val tableName = "hudi_cow_table1"
+val basePath = "/tmp/hudi_cow_table1"
+val dataGen = new DataGenerator
 ```
-hdfs namenode # start name node
-hdfs datanode # start data node
 
-bin/hive --service metastore  # start metastore
-bin/hiveserver2 \
-  --hiveconf hive.root.logger=INFO,console \
-  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
-  --hiveconf hive.stats.autogather=false \
-  --hiveconf 
hive.aux.jars.path=/path/to/packaging/hudi-hive-bundle/target/hudi-hive-bundle-0.4.6-SNAPSHOT.jar
+The 
[DataGenerator](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java)
 
+can generate sample inserts and updates based on the the sample trip schema 
[here](#sample-schema)
 
-```
 
-### Run Hive Sync Tool
-Hive Sync Tool will update/create the necessary metadata(schema and 
partitions) in hive metastore. This allows for schema evolution and incremental 
addition of new partitions written to.
-It uses an incremental approach by storing the last commit time synced in the 
TBLPROPERTIES and only syncing the commits from the last sync commit time 
stored.
-Both [Spark Datasource](writing_data.html#datasource-writer) & 
[DeltaStreamer](writing_data.html#deltastreamer) have capability to do this, 
after each write.
+## Write data {#inserts}
+Generate sample records and load them into a DataFrame. Write the DataFrame 
into the Hudi table as below.
 
 ```
-cd hudi-hive
-./run_sync_tool.sh
-  --user hive
-  --pass hive
-  --database default
-  --jdbc-url "jdbc:hive2://localhost:10010/"
-  --base-path tmp/hoodie/sample-table/
-  --table hoodie_test
-  --partitioned-by field1,field2
+val inserts = convertToStringList(dataGen.generateInserts(10)) 
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2));
+df.write.format("org.apache.hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(TABLE_NAME, 
tableName).mode(Overwrite).save(basePath);
+``` 
+You can check the data generated under 
`/tmp/hudi_cow_table1/<region>/<country>/<city>/`
 
-```
-For some reason, if you want to do this by hand. Please 
-follow 
[this](https://cwiki.apache.org/confluence/display/HUDI/Registering+sample+dataset+to+Hive+via+beeline).
+**Modelling a record in Hudi:**
 
+Hudi depends on a record key (`uuid` in [schema](#sample-schema)), partition 
field (`region/county/city`) and combine logic (`ts` in 
[schema](#sample-schema)) to handle duplicates. For more info refer to 
[Modeling data stored in 
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-HowdoImodelthedatastoredinHudi?)
 
-### HiveQL {#hive}
 
-Let's first perform a query on the latest committed snapshot of the table
+**Ways to ingest data into Hudi:**
+- ***DeltaStreamer:*** For ingesting changes from external sources like Kafka, 
tailing DFS, or even other Hudi datasets.
+- ***Hudi Datasource:*** Capture data from a custom source using the Spark 
datasource API and write into Hudi
+For more info on using these, refer to [Writing Hudi 
Datasets](https://hudi.apache.org/writing_data.html)
 
+ 
+## Read data {#query}
+Load the data files into a DataFrame.
 ```
-hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
-hive> set hive.stats.autogather=false;
-hive> add jar file:///path/to/hudi-hive-bundle-0.4.6-SNAPSHOT.jar;
-hive> select count(*) from hoodie_test;
-...
-OK
-100
-Time taken: 18.05 seconds, Fetched: 1 row(s)
-hive>
+val roViewDF = spark.read.format("org.apache.hudi").load(basePath + "/*/*/*/*")
+roViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where 
fare > 20.0").show()
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
 ```
+This query provides a read optimized view of the ingested data. The regex in 
load() refers to the location of data files.
+Since our partition path (`region/country/city`) is 3 levels nested from base 
path we ve used `basePath + "/*/*/*/*"`. 
+Refer to [Storage Types and 
Views](https://hudi.apache.org/concepts.html#storage-types--views) for more 
info on all storage types and views supported.
 
-### SparkSQL {#spark}
-
-Spark is super easy, once you get Hive working as above. Just spin up a Spark 
Shell as below
-
+## Update data {#updates}
+This is similar to inserting new data. Generate updates using the data 
generator and load to a DataFrame. Write DataFrame into the hudi table.
 ```
-$ cd $SPARK_INSTALL
-$ spark-shell --jars 
$HUDI_SRC/packaging/hudi-spark-bundle/target/hudi-spark-bundle-0.4.6-SNAPSHOT.jar
 --driver-class-path $HADOOP_CONF_DIR  --conf 
spark.sql.hive.convertMetastoreParquet=false --packages 
com.databricks:spark-avro_2.11:4.0.0
-
-scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-scala> sqlContext.sql("show tables").show(10000)
-scala> sqlContext.sql("describe hoodie_test").show(10000)
-scala> sqlContext.sql("describe hoodie_test_rt").show(10000)
-scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
+val updates = convertToStringList(dataGen.generateUpdates(10))
+val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
+df.write.format("org.apache.hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").option(TABLE_NAME, 
tableName).mode(Append).save(basePath);
 ```
+We are using the default write operation - update, here. For other operation 
types and when to use them refer to [write 
operations](https://hudi.apache.org/writing_data.html#write-operations)
 
 Review comment:
   "and how to choose them, refer to"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #937: [HUDI-291] Simplify quickstart documentation

Reply via email to