jonvex commented on issue #7494: URL: https://github.com/apache/hudi/issues/7494#issuecomment-1370067142
Here are the steps that I tried: 1. Download [spark-3.3.1-bin-hadoop3.tgz](https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz) 2. Set the environment variables ``` export SPARK_HOME=/Users/jon/Documents/spark-3.3.1-bin-hadoop3 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin export PYSPARK_SUBMIT_ARGS="--master local[*]" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH export PYSPARK_PYTHON=$(which python3) ``` 3. Ran the command to start ``` $SPARK_HOME/bin/pyspark \ --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.2 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` 4. Then In pyspark I ran the following ``` tableName = "hudi_trips_cow" basePath = "file:///tmp/hudi_trips_cow" dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) hudi_options = { 'hoodie.table.name': tableName, 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.datasource.write.partitionpath.field': 'partitionpath', 'hoodie.datasource.write.table.name': tableName, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2 } df.write.format("hudi"). \ options(**hudi_options). \ mode("overwrite"). \ save(basePath) ``` Following those steps I was unable to produce the issue presented -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
