risyomei opened a new issue, #4335:
URL: https://github.com/apache/kyuubi/issues/4335

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Describe the bug
   
   In the Spark on YARN cluster mode, when submitting a job with beeline, a 
Spark Driver will be created in ApplicationMaster hosted in NodeManager, and 
the Kyuubi will store the IP address in a ZNode in Zookeeper (or etcd).
   
   According to the design of YARN, when an ApplicationMaster (Spark Driver) 
fails, the ApplicationMaster will be recreated for resiliency.
   
   **The issue is that Kyuubi is not aware of this new Spark Driver and the job 
will fail immediately.** 
   
   A typical case of this issue is when rolling-restart your NodeManagers. The 
Job will fail immediately and retries will fail as well.
   
   **Procedures to Recreate this Issue.**
   1. Submit a Long-running Spark-SQL query through beeline (with Spark on 
YARN, cluster mode)
   2. Go to YARN ResourceManager UI, and identify the NodeManager where the 
Spark Driver is running
   3. Kill the Spark Driver. E.g. kill the process or just restart the Server.
   
   A new Spark Driver will be created but Kyuubi fails to find it.
   Expected: Kyuubi is aware of the new Spark Driver IP, and the HA information 
is updated accordingly.
   
   
   
   ### Affects Version(s)
   
   1.6.1
   
   ### Kyuubi Server Log Output
   
   _No response_
   
   ### Kyuubi Engine Log Output
   
   _No response_
   
   ### Kyuubi Server Configurations
   
   ```yaml
   # kyuubi.authentication   KERBEROS
   kyuubi.authentication LDAP
   kyuubi.authentication.ldap.base.dn ou=users,dc=mycompany,dc=com
   kyuubi.authentication.ldap.url ldaps://ldapserver.mycompany.com:636
   kyuubi.authentication.ldap.guidKey uid
   
   kyuubi.kinit.principal hive/[email protected]
   kyuubi.kinit.keytab /opt/kyuubi/conf/hive.keytab
   
   kyuubi.frontend.thrift.binary.bind.port 10009
   
   kyuubi.ha.enabled true
   kyuubi.ha.addresses 
zookeeper1.mycompany.com,zookeeper2.mycompany.com,zookeeper3.mycompany.com
   kyuubi.ha.zookeeper.client.port 2181
   kyuubi.ha.zookeeper.session.timeout 600000
   
   kyuubi.ha.namespace=kyuubi_zk_onpremise
   kyuubi.session.engine.initialize.timeout=180000
   ```
   
   
   ### Kyuubi Engine Configurations
   
   ```yaml
   spark.acls.enable=true
   spark.submit.deployMode=cluster
   spark.admin.acls=
   spark.driver.extraClassPath=
   spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
   spark.driver.memory=8g
   spark.dynamicAllocation.enabled=true
   spark.dynamicAllocation.initialExecutors=0
   spark.dynamicAllocation.maxExecutors=100
   spark.dynamicAllocation.minExecutors=0
   spark.eventLog.dir=viewfs:///log/spark3
   spark.eventLog.enabled=true
   spark.executor.cores=4
   spark.executor.extraJavaOptions=-XX:+UseNUMA
   
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
   spark.executor.memory=20g
   spark.extraListeners=
   spark.hadoop.dfs.client.datanode-restart.timeout=30
   spark.hadoop.yarn.timeline-service.enabled=false
   spark.io.compression.lz4.blockSize=128kb
   spark.master=yarn
   spark.pyspark.python=/opt/spark-env/bin/python
   spark.shuffle.file.buffer=1m
   spark.shuffle.io.backLog=8192
   spark.shuffle.io.serverThreads=128
   spark.shuffle.service.enabled=true
   spark.shuffle.service.name=spark3_2_shuffle
   spark.shuffle.service.port=7338
   spark.shuffle.unsafe.file.output.buffer=5m
   spark.sql.autoBroadcastJoinThreshold=-1
   spark.sql.adaptive.autoBroadcastJoinThreshold=100m
   spark.sql.adaptive.advisoryPartitionSizeInBytes=256m
   spark.sql.adaptive.coalescePartitions.minPartitionSize=16m
   spark.sql.adaptive.enabled=true
   spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=512m
   spark.sql.adaptive.coalescePartitions.parallelismFirst=false
   spark.sql.cbo.enabled=true
   spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER
   spark.sql.hive.convertMetastoreOrc=true
   spark.sql.hive.metastore.jars=builtin
   spark.sql.orc.compression.codec=zlib
   spark.sql.orc.filterPushdown=true
   spark.sql.orc.impl=native
   spark.sql.parquet.compression.codec=gzip
   spark.sql.queryExecutionListeners=
   spark.sql.sources.partitionOverwriteMode=dynamic
   spark.sql.statistics.fallBackToHdfs=true
   spark.sql.streaming.streamingQueryListeners=
   spark.sql.warehouse.dir=/apps/hive/warehouse
   spark.ui.filters=org.apache.spark.deploy.yarn.YarnProxyRedirectFilter
   spark.unsafe.sorter.spill.reader.buffer.size=1m
   spark.yarn.dist.files=
   spark.yarn.historyServer.address=sparkhistory.mycompany.com:18080
   spark.yarn.historyServer.allowTracking=true
   
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
   spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
   
spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
   
spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker
   spark.files=/opt/spark/conf/atlas-application.properties
   ```
   
   
   ### Additional context
   
   I am new to Scala, but I am willing to contribute.
   
   ### Are you willing to submit PR?
   
   - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi 
community to fix.
   - [ ] No. I cannot submit a PR at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to