[
https://issues.apache.org/jira/browse/HUDI-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-4211:
-----------------------------
Fix Version/s: (was: 0.11.0)
> The hudi docker demo failed to execute the delta-streamer and ingest to
> stock_ticks_cow table in HDFS
> -----------------------------------------------------------------------------------------------------
>
> Key: HUDI-4211
> URL: https://issues.apache.org/jira/browse/HUDI-4211
> Project: Apache Hudi
> Issue Type: Bug
> Components: cli, meta-sync, spark
> Environment: [root@hudi hive_base]# docker images
> REPOSITORY TAG
> IMAGE ID CREATED SIZE
> docker.io/graphiteapp/graphite-statsd latest
> 5742c9c6f1db 2 weeks ago 850 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4 latest
> 07880b8f5978 3 months ago 2.01 GB
> docker.io/apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkworker_2.4.4 latest
> d5344418db27 3 months ago 1.59 GB
> docker.io/apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.4.4 latest
> 6903d097f47b 3 months ago 1.59 GB
> docker.io/apachehudi/hudi-hadoop_2.8.4-hive_2.3.3 latest
> 678d033ee64c 3 months ago 1.29 GB
> docker.io/apachehudi/hudi-hadoop_2.8.4-history latest
> 995dc55f7fbc 3 months ago 964 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-datanode latest
> 156ea075fb0e 3 months ago 964 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-namenode latest
> 550cfdc43cc8 3 months ago 964 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-prestobase_0.271 latest
> 7d1a076fa27b 3 months ago 2.69 GB
> docker.io/graphiteapp/graphite-statsd <none>
> d49e5c8fe07a 3 months ago 847 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-trinoworker_368 latest
> d4020d02727a 4 months ago 2.93 GB
> docker.io/apachehudi/hudi-hadoop_2.8.4-trinocoordinator_368 latest
> 9ed7e8f84f5b 4 months ago 2.93 GB
> docker.io/bde2020/hive-metastore-postgresql 2.3.0
> 7ab9e8f93813 2 years ago 275 MB
> docker.io/apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.3.1 latest
> 70dc18c432a0 3 years ago 1.64 GB
> docker.io/bitnami/kafka 2.0.0
> 6ff9736c1996 3 years ago 423 MB
> docker.io/bitnami/zookeeper
> 3.4.12-r68 50b53cf5fcad 3 years ago 414 MB
> Reporter: chenyunliang
> Priority: Blocker
> Labels: docke
>
> When I execute the following code in container adhoc-2:
> {code:java}
> spark-submit \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> $HUDI_UTILITIES_BUNDLE \
> --table-type COPY_ON_WRITE \
> --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
> --source-ordering-field ts \
> --target-base-path /user/hive/warehouse/stock_ticks_cow \
> --target-table stock_ticks_cow --props
> /var/demo/config/kafka-source.properties \
> --schemaprovider-class
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider {code}
> An error is as follows:
> {code:java}
> root@adhoc-2:/opt# spark-submit \
> > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > $HUDI_UTILITIES_BUNDLE \
> > --table-type COPY_ON_WRITE \
> > --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
> > --source-ordering-field ts \
> > --target-base-path /user/hive/warehouse/stock_ticks_cow \
> > --target-table stock_ticks_cow --props
> > /var/demo/config/kafka-source.properties \
> > --schemaprovider-class
> > org.apache.hudi.utilities.schema.FilebasedSchemaProvider
> Exception in thread "main" org.apache.spark.SparkException: Cannot load main
> class from JAR file:/opt/%C2%A0
> at
> org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
> at
> org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
> at
> org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:907)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code}
> When I check the environment variable $HUDI_UTILITIES_BUNDLE, I got this:
> {code:java}
> root@adhoc-2:/opt# echo $HUDI_UTILITIES_BUNDLE
> /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar
> {code}
> But, I can't find the jar file:
> {code:java}
> root@adhoc-2:/opt# ls -ltr
> /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar
> ls: cannot access
> '/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar':
> No suchfile or directory {code}
> When I try find this:
> {code:java}
> root@adhoc-2:/opt# find /var/hoodie/ws -name "hudi-utilities-bundle*.0.jar" |
> xargs ls -ltr
> -rw-r--r-- 1 root root 60631874 Jun 8 07:41
> /var/hoodie/ws/hudi-examples/hudi-examples-spark/target/lib/hudi-utilities-bundle_2.11-0.11.0.jar
> -rw-r--r-- 1 root root 60631874 Jun 8 07:41
> /var/hoodie/ws/hudi-cli/target/lib/hudi-utilities-bundle_2.11-0.11.0.jar
> -rw-r--r-- 1 root root 60631874 Jun 8 07:41
> /var/hoodie/ws/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0.jar
> {code}
> So I tried to modify the environment variable $HUDI_UTILITIES_BUNDLE, and
> resubmit the command, it worked:
> {code:java}
> root@adhoc-2:/opt# spark-submit \
> > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > $HUDI_UTILITIES_BUNDLE \
> > --table-type COPY_ON_WRITE \
> > --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
> > --source-ordering-field ts \
> > --target-base-path /user/hive/warehouse/stock_ticks_cow \
> > --target-table stock_ticks_cow --props
> > /var/demo/config/kafka-source.properties \
> > --schemaprovider-class
> > org.apache.hudi.utilities.schema.FilebasedSchemaProvider
> 22/06/09 01:43:34 WARN NativeCodeLoader: Unable to load native-hadoop library
> for your platform... using builtin-java classes where applicable
> 22/06/09 01:43:35 WARN SchedulerConfGenerator: Job Scheduling Configs will
> not be in effect as spark.scheduler.mode is not set to FAIR at instantiation
> time. Continuing without scheduling configs
> 22/06/09 01:43:36 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR,
> please set it as the dirof hudi-defaults.conf
> 22/06/09 01:43:36 WARN DFSPropertiesConfiguration: Properties file
> file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
> 22/06/09 01:43:36 WARN SparkContext: Using an existing SparkContext; some
> configuration may not take effect.
> 22/06/09 01:43:37 WARN KafkaUtils: overriding enable.auto.commit to false for
> executor
> 22/06/09 01:43:37 WARN KafkaUtils: overriding auto.offset.reset to none for
> executor
> 22/06/09 01:43:37 ERROR KafkaUtils: group.id is null, you should probably set
> it
> 22/06/09 01:43:37 WARN KafkaUtils: overriding executor group.id to
> spark-executor-null
> 22/06/09 01:43:37 WARN KafkaUtils: overriding receive.buffer.bytes to 65536
> see KAFKA-3135
> 22/06/09 01:43:38 WARN HoodieBackedTableMetadata: Metadata table was not
> found at path /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata
> 00:05 WARN: Timeline-server-based markers are not supported for HDFS: base
> path /user/hive/warehouse/stock_ticks_cow. Falling back to direct markers.
> 00:06 WARN: Timeline-server-based markers are not supported for HDFS: base
> path /user/hive/warehouse/stock_ticks_cow. Falling back to direct markers.
> 00:08 WARN: Timeline-server-based markers are not supported for HDFS: base
> path /user/hive/warehouse/stock_ticks_cow. Falling back to direct markers.
> {code}
> I could view the data had been written in the HDFS:
> {code:java}
> root@adhoc-2:/opt# hdfs dfs -ls /user/hive/warehouse/stock_ticks_cow/*/*/*/*
> Found 1 items
> drwxr-xr-x - root supergroup 0 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/.aux/.bootstrap
> -rw-r--r-- 1 root supergroup 8056 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/00000000000000.deltacommit
> -rw-r--r-- 1 root supergroup 3035 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/00000000000000.deltacommit.inflight
> -rw-r--r-- 1 root supergroup 0 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/00000000000000.deltacommit.requested
> -rw-r--r-- 1 root supergroup 8139 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/20220609014338711.deltacommit
> -rw-r--r-- 1 root supergroup 3035 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/20220609014338711.deltacommit.inflight
> -rw-r--r-- 1 root supergroup 0 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/20220609014338711.deltacommit.requested
> -rw-r--r-- 1 root supergroup 599 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/.hoodie/hoodie.properties
> -rw-r--r-- 1 root supergroup 124 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-0-0
> -rw-r--r-- 1 root supergroup 21951 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-10-10
> -rw-r--r-- 1 root supergroup 93 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/.hoodie/metadata/files/.hoodie_partition_metadata
> -rw-r--r-- 1 root supergroup 96 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/2018/08/31/.hoodie_partition_metadata
> -rw-r--r-- 1 root supergroup 436884 2022-06-09 01:43
> /user/hive/warehouse/stock_ticks_cow/2018/08/31/7610b058-8df2-484a-ba70-881feef7195e-0_0-36-35_20220609014338711.parquet
> {code}
> So my question is whether I need to modify $HUDI_UTILITIES_BUNDLE ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)