WaterKnight1998 opened a new issue #1776:
URL: https://github.com/apache/hudi/issues/1776


   
   
   **To Reproduce**
   
   I was trying to use PySpark with Hudi to create a table in Google Storage. I 
have code before that read data from Google Storage so the connector is not the 
problem.
   
   I can see that hudi stores something in Google Cloud, but code is producing 
that error.
   
   The code was as follows:
   ```
   tableName = "forecasts"
   basePath = "gs://hudi-datalake/" + tableName
   
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.recordkey.field': 'uuid',
     'hoodie.datasource.write.partitionpath.field': 'partitionpath',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'ts',
     'hoodie.upsert.shuffle.parallelism': 2, 
     'hoodie.insert.shuffle.parallelism': 2
   }
   
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(basePath)
   ```
   
   However, this code produces the following error:
   ```
   Py4JJavaError: An error occurred while calling o346.save.
   : java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
        at 
io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
        at io.javalin.Javalin.<init>(Javalin.java:94)
        at io.javalin.Javalin.create(Javalin.java:107)
        at 
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:102)
        at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
        at 
org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:102)
        at 
org.apache.hudi.client.AbstractHoodieClient.<init>(AbstractHoodieClient.java:69)
        at 
org.apache.hudi.client.AbstractHoodieWriteClient.<init>(AbstractHoodieWriteClient.java:83)
        at 
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:137)
        at 
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:124)
        at 
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:120)
        at 
org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:195)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:135)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
   
   (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred 
while calling o346.save.\n', JavaObject id=o347), <traceback object at 
0x7f1a1ce00b48>)
   ```
   
   Trying to read the data as follows:
   ```
   tableName = "forecasts"
   basePath = "gs://hudi-datalake/" + tableName
   
   tripsSnapshotDF = spark. \
     read. \
     format("hudi"). \
     load(basePath)
   # load(basePath) use "/partitionKey=partitionValue" folder structure for 
Spark auto partition discovery
   
   tripsSnapshotDF.createOrReplaceTempView("forecasts")
   ```
   
   Give another error:
   ```
   Fail to execute line 7:   load(basePath)
   Traceback (most recent call last):
     File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o399.load.
   : org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.;
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
        at scala.Option.getOrElse(Option.scala:121)
        at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:47)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
   
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/tmp/1593550879377-0/zeppelin_python.py", line 153, in <module>
       exec(code, _zcUserQueryNameSpace)
     File "<stdin>", line 7, in <module>
     File "/opt/spark/python/pyspark/sql/readwriter.py", line 166, in load
       return self._df(self._jreader.load(path))
     File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/opt/spark/python/pyspark/sql/utils.py", line 69, in deco
       raise AnalysisException(s.split(': ', 1)[1], stackTrace)
   pyspark.sql.utils.AnalysisException: 'Unable to infer schema for Parquet. It 
must be specified manually.;'
   ```
   
   **Environment Description**
   
   * Hudi version :0.5.3
   
   * Spark version : 2.4.5
   
   * Hive version :
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to