risyomei opened a new issue, #4335: URL: https://github.com/apache/kyuubi/issues/4335
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug In the Spark on YARN cluster mode, when submitting a job with beeline, a Spark Driver will be created in ApplicationMaster hosted in NodeManager, and the Kyuubi will store the IP address in a ZNode in Zookeeper (or etcd). According to the design of YARN, when an ApplicationMaster (Spark Driver) fails, the ApplicationMaster will be recreated for resiliency. **The issue is that Kyuubi is not aware of this new Spark Driver and the job will fail immediately.** A typical case of this issue is when rolling-restart your NodeManagers. The Job will fail immediately and retries will fail as well. **Procedures to Recreate this Issue.** 1. Submit a Long-running Spark-SQL query through beeline (with Spark on YARN, cluster mode) 2. Go to YARN ResourceManager UI, and identify the NodeManager where the Spark Driver is running 3. Kill the Spark Driver. E.g. kill the process or just restart the Server. A new Spark Driver will be created but Kyuubi fails to find it. Expected: Kyuubi is aware of the new Spark Driver IP, and the HA information is updated accordingly. ### Affects Version(s) 1.6.1 ### Kyuubi Server Log Output _No response_ ### Kyuubi Engine Log Output _No response_ ### Kyuubi Server Configurations ```yaml # kyuubi.authentication KERBEROS kyuubi.authentication LDAP kyuubi.authentication.ldap.base.dn ou=users,dc=mycompany,dc=com kyuubi.authentication.ldap.url ldaps://ldapserver.mycompany.com:636 kyuubi.authentication.ldap.guidKey uid kyuubi.kinit.principal hive/[email protected] kyuubi.kinit.keytab /opt/kyuubi/conf/hive.keytab kyuubi.frontend.thrift.binary.bind.port 10009 kyuubi.ha.enabled true kyuubi.ha.addresses zookeeper1.mycompany.com,zookeeper2.mycompany.com,zookeeper3.mycompany.com kyuubi.ha.zookeeper.client.port 2181 kyuubi.ha.zookeeper.session.timeout 600000 kyuubi.ha.namespace=kyuubi_zk_onpremise kyuubi.session.engine.initialize.timeout=180000 ``` ### Kyuubi Engine Configurations ```yaml spark.acls.enable=true spark.submit.deployMode=cluster spark.admin.acls= spark.driver.extraClassPath= spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native spark.driver.memory=8g spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialExecutors=0 spark.dynamicAllocation.maxExecutors=100 spark.dynamicAllocation.minExecutors=0 spark.eventLog.dir=viewfs:///log/spark3 spark.eventLog.enabled=true spark.executor.cores=4 spark.executor.extraJavaOptions=-XX:+UseNUMA spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native spark.executor.memory=20g spark.extraListeners= spark.hadoop.dfs.client.datanode-restart.timeout=30 spark.hadoop.yarn.timeline-service.enabled=false spark.io.compression.lz4.blockSize=128kb spark.master=yarn spark.pyspark.python=/opt/spark-env/bin/python spark.shuffle.file.buffer=1m spark.shuffle.io.backLog=8192 spark.shuffle.io.serverThreads=128 spark.shuffle.service.enabled=true spark.shuffle.service.name=spark3_2_shuffle spark.shuffle.service.port=7338 spark.shuffle.unsafe.file.output.buffer=5m spark.sql.autoBroadcastJoinThreshold=-1 spark.sql.adaptive.autoBroadcastJoinThreshold=100m spark.sql.adaptive.advisoryPartitionSizeInBytes=256m spark.sql.adaptive.coalescePartitions.minPartitionSize=16m spark.sql.adaptive.enabled=true spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=512m spark.sql.adaptive.coalescePartitions.parallelismFirst=false spark.sql.cbo.enabled=true spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER spark.sql.hive.convertMetastoreOrc=true spark.sql.hive.metastore.jars=builtin spark.sql.orc.compression.codec=zlib spark.sql.orc.filterPushdown=true spark.sql.orc.impl=native spark.sql.parquet.compression.codec=gzip spark.sql.queryExecutionListeners= spark.sql.sources.partitionOverwriteMode=dynamic spark.sql.statistics.fallBackToHdfs=true spark.sql.streaming.streamingQueryListeners= spark.sql.warehouse.dir=/apps/hive/warehouse spark.ui.filters=org.apache.spark.deploy.yarn.YarnProxyRedirectFilter spark.unsafe.sorter.spill.reader.buffer.size=1m spark.yarn.dist.files= spark.yarn.historyServer.address=sparkhistory.mycompany.com:18080 spark.yarn.historyServer.allowTracking=true spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker spark.files=/opt/spark/conf/atlas-application.properties ``` ### Additional context I am new to Scala, but I am willing to contribute. ### Are you willing to submit PR? - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
