xushiyan commented on code in PR #7005:
URL: https://github.com/apache/hudi/pull/7005#discussion_r1001331925


##########
packaging/bundle-validation/utilities/newSchema.avsc:
##########
@@ -0,0 +1,54 @@
+{
+    "type" : "record",
+    "name" : "test_struct",

Review Comment:
   the schema is here docker/demo/config/schema.avsc



##########
packaging/bundle-validation/ci_run.sh:
##########
@@ -59,7 +54,34 @@ elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then
   IMAGE_TAG=spark330hive313
 fi
 
-cd packaging/bundle-validation/spark-write-hive-sync || exit 1
+# Copy bundle jars
+BUNDLE_VALIDATION_DIR=${GITHUB_WORKSPACE}/bundle-validation
+mkdir $BUNDLE_VALIDATION_DIR
+JARS_DIR=${BUNDLE_VALIDATION_DIR}/jars
+mkdir $JARS_DIR
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-${SPARK_PROFILE}-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/spark.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities-slim.jar

Review Comment:
   this pattern `packaging/<bundle dir>/target/hudi-*-$HUDI_VERSION.jar ` 
should guarantee to copy the intended jar, as long as we do `mvn clean package` 
which will clean jars from last build. so we can avoid some interpolation with 
env vars, kinda error-prone.



##########
packaging/bundle-validation/validate.sh:
##########
@@ -0,0 +1,118 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# NOTE: this script runs inside hudi-ci-bundle-validation container
+# $WORKDIR/jars/ is supposed to be mounted to a host directory where bundle 
jars are placed
+# TODO: $JAR_COMBINATIONS should have different orders for different jars to 
detect class loading issues
+
+WORKDIR=/opt/bundle-validation
+HIVE_DATA=${WORKDIR}/data/hive
+JAR_DATA=${WORKDIR}/data/jars
+UTILITIES_DATA=${WORKDIR}/data/utilities
+
+test_spark_bundle () {
+    echo "::warning::validate.sh setting up hive sync"
+    # put config files in correct place
+    cp $HIVE_DATA/spark-defaults.conf $SPARK_HOME/conf/
+    cp $HIVE_DATA/hive-site.xml $HIVE_HOME/conf/
+    ln -sf $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/hive-site.xml
+    cp $DERBY_HOME/lib/derbyclient.jar $SPARK_HOME/jars/
+
+    $DERBY_HOME/bin/startNetworkServer -h 0.0.0.0 &
+    $HIVE_HOME/bin/hiveserver2 &
+    echo "::warning::validate.sh hive setup complete. Testing"
+    $SPARK_HOME/bin/spark-shell --jars $JAR_DATA/spark.jar < 
$HIVE_DATA/validate.scala
+    if [ "$?" -ne 0 ]; then
+        echo "::error::validate.sh failed hive testing)"
+        exit 1
+    fi
+    echo "::warning::validate.sh hive testing succesfull. Cleaning up hive 
sync"
+    # remove config files
+    rm -f $SPARK_HOME/jars/derbyclient.jar
+    unlink $SPARK_HOME/conf/hive-site.xml
+    rm -f $HIVE_HOME/conf/hive-site.xml
+    rm -f $SPARK_HOME/conf/spark-defaults.conf
+}
+
+test_utilities_bundle () {
+    OPT_JARS=""
+    if [[ -n $ADDITIONAL_JARS ]]; then
+        OPT_JARS="--jars $ADDITIONAL_JARS"
+    fi
+    echo "::warning::validate.sh running deltastreamer"
+    $SPARK_HOME/bin/spark-submit --driver-memory 8g --executor-memory 8g \
+    --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+    $OPT_JARS $MAIN_JAR \
+    --props $UTILITIES_DATA/newProps.props \
+    --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+    --source-class org.apache.hudi.utilities.sources.JsonDFSSource \
+    --source-ordering-field ts --table-type MERGE_ON_READ \
+    --target-base-path file://${OUTPUT_DIR} \
+    --target-table utilities_tbl  --op UPSERT
+    echo "::warning::validate.sh done with deltastreamer"
+
+    OUTPUT_SIZE=$(du -s ${OUTPUT_DIR} | awk '{print $1}')
+    if [[ -z $OUTPUT_SIZE || "$OUTPUT_SIZE" -lt "550" ]]; then
+        echo "::error::validate.sh deltastreamer output folder ($OUTPUT_SIZE) 
is smaller than expected (550) )" 
+        exit 1
+    fi
+
+    echo "::warning::validate.sh validating deltastreamer in spark shell"
+    SHELL_COMMAND="$SPARK_HOME/bin/spark-shell --jars $ADDITIONAL_JARS 
$MAIN_JAR $SHELL_ARGS -i $COMMANDS_FILE"
+    echo "::debug::this is the shell command: $SHELL_COMMAND"
+    LOGFILE="$WORKDIR/submit.log"
+    $SHELL_COMMAND >> $LOGFILE
+    if [ "$?" -ne 0 ]; then
+        SHELL_RESULT=$(cat $LOGFILE | grep "Counts don't match")
+        echo "::error::validate.sh $SHELL_RESULT"
+        exit 1
+    fi
+    echo "::warning::validate.sh done validating deltastreamer in spark shell"
+}
+
+
+# test_spark_bundle
+# if [ "$?" -ne 0 ]; then
+#     exit 1
+# fi

Review Comment:
   clean up?



##########
packaging/bundle-validation/ci_run.sh:
##########
@@ -59,7 +54,34 @@ elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then
   IMAGE_TAG=spark330hive313
 fi
 
-cd packaging/bundle-validation/spark-write-hive-sync || exit 1
+# Copy bundle jars
+BUNDLE_VALIDATION_DIR=${GITHUB_WORKSPACE}/bundle-validation
+mkdir $BUNDLE_VALIDATION_DIR
+JARS_DIR=${BUNDLE_VALIDATION_DIR}/jars
+mkdir $JARS_DIR
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-${SPARK_PROFILE}-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/spark.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities-slim.jar

Review Comment:
   it's helpful to keep the original jar artifact name while copying so we can 
ls dir and verify what versions are actually used. also helpful when 
troubleshoot. you can make symlink $JARS_DIR/spark.jar -> $JARS_DIR/<full 
bundle name>.jar and use fixed symlink paths for application later



##########
packaging/bundle-validation/utilities/newProps.props:
##########
@@ -0,0 +1,11 @@
+hoodie.datasource.write.recordkey.field=key
+hoodie.datasource.write.partitionpath.field=date
+hoodie.datasource.write.precombine.field=ts
+hoodie.metadata.enable=true
+hoodie.upsert.shuffle.parallelism=8
+hoodie.insert.shuffle.parallelism=8
+hoodie.delete.shuffle.parallelism=8
+hoodie.bulkinsert.shuffle.parallelism=8
+hoodie.deltastreamer.source.dfs.root=/opt/bundle-validation/data/utilities/data
+hoodie.deltastreamer.schemaprovider.target.schema.file=file:/opt/bundle-validation/data/utilities/newSchema.avsc
+hoodie.deltastreamer.schemaprovider.source.schema.file=file:/opt/bundle-validation/data/utilities/newSchema.avsc

Review Comment:
   not sure what do "newSchema" and  "newProps" imply.. better make the name 
meaningful.. e.g. stocks_schema.avsc and <name for this job>.properties ? btw 
`.properties` should be the right suffix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to