worf0815 opened a new issue #5053:
URL: https://github.com/apache/hudi/issues/5053
**Describe the problem you faced**
Running pyspark on AWS EMR 6.5.0 Cluster with Hudi Enabled results in an
exception when trying to access the glue catalog.
**To Reproduce**
Steps to reproduce the behavior:
1. Start pyspark shell `pyspark --conf
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf
"spark.sql.hive.convertMetastoreParquet=false" --jars
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar`
2. Configure session to utilize Hive support
```
spark = SparkSession.builder \
.appName("job_name") \
.config(conf=conf) \
.enableHiveSupport() \
.getOrCreate()
```
3. Connect to database or set as current database
`spark.catalog.setCurrentDatabase("mydatabase")`
4. Exception `java.lang.NoSuchMethodError:
com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;`
is thrown
**Expected behavior**
Without specifiying any of the Hudi Jars or options, pyspark is able to
connect to the glue catalog. This should be also possible with Hudi.
**Environment Description**
* Hudi version : 0.9.0 (included in EMR 6.5.0)
* Spark version : 3.1.2
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Used EMR 6.5.0 and started pyspark shell according to
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
**Stacktrace**
`>>> spark.catalog.setCurrentDatabase("customer_sdwh_src")`
```
22/03/16 13:12:14 WARN CredentialsLegacyConfigLocationProvider: Found the
legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to
the latest default location [~/.aws/credentials].
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/catalog.py", line 53, in
setCurrentDatabase
return self._jcatalog.setCurrentDatabase(dbName)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o230.setCurrentDatabase.
: java.lang.NoSuchMethodError:
com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
at
com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:39)
at
com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:29)
at
com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
at
com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at
com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1734)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1454)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1369)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at
com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
at
com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
at
com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
at
com.amazonaws.services.glue.AWSGlueClient.executeGetDatabase(AWSGlueClient.java:4466)
at
com.amazonaws.services.glue.AWSGlueClient.getDatabase(AWSGlueClient.java:4435)
at
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:238)
at
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:151)
at
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
at
org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
at
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
at
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
at
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
at
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:257)
at
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
at
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
at
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
at
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
at
org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:384)
at
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
at
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
at
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
at
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:135)
at
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:125)
at
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:44)
at
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:51)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:98)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:98)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:266)
at
org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44)
at
org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]