WilliamWhispell opened a new issue #1789:
URL: https://github.com/apache/hudi/issues/1789
**Describe the problem you faced**
I'm trying to run a hudi write inside a glue job. My understanding is that
Glue 1.0 uses Spark 2.4.3 and Hadoop 2.8.5.
I've added hudi-spark-bundle_2.11-0.5.3.jar and spark-avro_2.11-2.4.3.jar as
dependant jars on the Glue job.
However, often the job fails with:
class threw exception: java.lang.NoSuchMethodError:
org.eclipse.jetty.util.thread.QueuedThreadPool.<init>(III)V
at
io.javalin.core.util.JettyServerUtil.defaultServer(JettyServerUtil.kt:43)
at io.javalin.Javalin.<init>(Javalin.java:94)
at io.javalin.Javalin.create(Javalin.java:107)
at
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:102)
at
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
at
org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:102)
at
org.apache.hudi.client.AbstractHoodieClient.<init>(AbstractHoodieClient.java:69)
at
org.apache.hudi.client.AbstractHoodieWriteClient.<init>(AbstractHoodieWriteClient.java:83)
at
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:137)
at
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:124)
at
org.apache.hudi.client.HoodieWriteClient.<init>(HoodieWriteClient.java:120)
at
org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:195)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:135)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at GlueApp$.main(script_2020-07-03-14-45-41.scala:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
com.amazonaws.services.glue.util.GlueExceptionWrapper$$anonfun$1.apply$mcV$sp(GlueExceptionWrapper.scala:35)
at com.amazonaws.
This makes me think I have some type of dependency issue.
Reading over the release notes
https://hudi.apache.org/releases.html#migration-guide-for-this-release-2 - the
only requirement I could find for spark was: IMPORTANT This version requires
your runtime spark version to be upgraded to 2.4+.
So I would expect this to work on Spark 2.4.3 but I'm not sure if the two
jars I added are all that is needed.
Here is what my code looks like (Scala 2.11):
object GlueApp {
def main(sysArgs: Array[String]) {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val spark: SparkSession = glueContext.getSparkSession
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME",
"input_file", "schema_file", "target_table", "target_s3_path",
"save_mode").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
...
df.write.format("org.apache.hudi").options(hudiOptions).option("hoodie.consistency.check.enabled",
"true").mode(saveMode).save(s3SaveLocation)
**Environment Description**
* Hudi version : 0.5.3
* Spark version : 2.4.3
* Hive version : ?
* Hadoop version : 2.8.5
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]