jonvex commented on issue #7494:
URL: https://github.com/apache/hudi/issues/7494#issuecomment-1370067142

   Here are the steps that I tried:
   
   1. Download 
[spark-3.3.1-bin-hadoop3.tgz](https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz)
   2. Set the environment variables 
   ```
   export SPARK_HOME=/Users/jon/Documents/spark-3.3.1-bin-hadoop3
   export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
   export PYSPARK_SUBMIT_ARGS="--master local[*]"
   export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
   export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
   export PYSPARK_PYTHON=$(which python3)
   ```
   3. Ran the command to start 
   ```
   $SPARK_HOME/bin/pyspark \
   --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.2 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   ```
   4. Then In pyspark I ran the following
   ```
   tableName = "hudi_trips_cow"
   basePath = "file:///tmp/hudi_trips_cow"
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   
   hudi_options = {
       'hoodie.table.name': tableName,
       'hoodie.datasource.write.recordkey.field': 'uuid',
       'hoodie.datasource.write.partitionpath.field': 'partitionpath',
       'hoodie.datasource.write.table.name': tableName,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.precombine.field': 'ts',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2
   }
   
   df.write.format("hudi"). \
       options(**hudi_options). \
       mode("overwrite"). \
       save(basePath)
   ```
   Following those steps I was unable to produce the issue presented


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to