[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

lamber-ken (Jira) Fri, 03 Jan 2020 20:17:35 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007908#comment-17007908
 ]


lamber-ken commented on HUDI-486:
---------------------------------

Let me summarize the reproduce steps.

1, use git checkout commit e1e5fe33249bf511486073dd9cf48e5b7ea14816 

2, build source
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 
{code}
3, setup docker 
{code:java}
cd docker && ./setup_demo.sh{code}
4, generate data
{code:java}
cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
{code}
5, go into the container
{code:java}
docker exec -it adhoc-2 /bin/bash 
{code}
6, comsume kafka data && sync to hive
{code:java}
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
--storage-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties 
\
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

/var/hoodie/ws/hudi-hive/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
{code}
7, create incr_pull.txt
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
8, execute org.apache.hudi.utilities.HiveIncrementalPuller
{code:java}
java -cp 
./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/commons-io-2.4.jar:$HUDI_UTILITIES_BUNDLE
 \ org.apache.hudi.utilities.HiveIncrementalPuller \ --hiveUrl 
jdbc:hive2://hiveserver:10000 \ --hiveUser hive \ --hivePass hive \ 
--extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt \ --sourceDb 
default \ --sourceTable stock_ticks_cow \ --targetDb default \ --tmpdb default 
\ --targetTable tempTable \ --fromCommitTime 0 \ --maxCommits 1
{code}

> Improve documentation for using HiveIncrementalPuller
> -----------------------------------------------------
>
>                 Key: HUDI-486
>                 URL: https://issues.apache.org/jira/browse/HUDI-486
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Incremental Pull
>            Reporter: Pratyaksh Sharma
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

Reply via email to