lewisrobbins opened a new issue, #6171:
URL: https://github.com/apache/hudi/issues/6171

   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create an EMR Cluster with EMR release 6.6.0
   2. Create a simple Spark/Hudi job3. 
   3. Submit to spark with `spark-submit hudi.py --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
"spark.sql.hive.convertMetastoreParquet=false" --jars 
/usr/lib/hudi/hudi-spark-bundle.jar`
   4. Observe stack trace error
   
   
   ```python
   import sys
   from pyspark.sql import SparkSession
   
   
   spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())
   
   # Create a DataFrame
   inputDF = spark.createDataFrame(
       [
           ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
           ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
       ],
       ["id", "creation_date", "last_update_time"]
   )
   
   # Specify common DataSourceWriteOptions in the single hudiOptions variable
   hudiOptions = {
   'hoodie.table.name': 'my_hudi_table',
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.partitionpath.field': 'creation_date',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.table': 'my_hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   }
   
   # Write a DataFrame as a Hudi dataset
   inputDF.write \
   .format('org.apache.hudi') \
   .option('hoodie.datasource.write.operation', 'insert') \
   .options(**hudiOptions) \
   .mode('overwrite') \
   .save('s3://my-bucket/myhudidataset/')
   ```
   
   I have also tried saving with `inputDF.write.format('hudi`), but I get the 
same error
   
   **Expected behavior**
   
   Spark saves data as Hudi in S3
   
   **Environment Description**
   
   * Hudi version : 0.10.1-amzn-0
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   Stack trace:
   
   ```
   Traceback (most recent call last):
     File "/home/hadoop/hudi.py", line 42, in <module>
       .save('s3://us-flights-data/myhudidataset/')
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
line 740, in save
     File 
"/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 
1310, in __call__
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
111, in deco
     File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", 
line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o88.save.
   : java.lang.ClassNotFoundException: 
   Failed to find data source: org.apache.hudi. Please find packages at
   http://spark.apache.org/third-party-projects.html
          
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:443)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
        at 
org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:656)
        at scala.util.Try$.apply(Try.scala:213)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:656)
        at scala.util.Failure.orElse(Try.scala:224)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
        ... 16 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to