vasu2809 opened a new issue, #29258:
URL: https://github.com/apache/airflow/issues/29258

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   We are using ComputeEngineSSHHook for some of our Airflow DAGS in Cloud 
Composer
   
   Everything works fine when DAGs run one by one
   
   But when we start parallelism where multiple tasks are trying to connect to 
our GCE instance using ComputeEngineSSHHook at the same time,
   
   We experience intermittent errors like the one give below
   
   Since cloud composer by default has 3 retries, sometimes in the second or 
third attempt this issue gets resolved automatically but we would like to 
understand why this issue comes in the first place when there are multiple 
operators trying to generate keys and SSH into GCE instance
   
   We have tried maintaining the DAG task with banner_timeout and 
expire_timeout parameters but we still see this issue
   
   create_transfer_run_directory = SSHOperator(
           task_id="create_transfer_run_directory",
           ssh_hook=ComputeEngineSSHHook(
               instance_name=GCE_INSTANCE,
               zone=GCE_ZONE,
               use_oslogin=True,
               use_iap_tunnel=False,
               use_internal_ip=True,
           ),
           conn_timeout = 120,
           cmd_timeout = 120,
           banner_timeout = 120.0,
           command=f"sudo mkdir -p {transfer_run_directory}/"
           '{{ ti.xcom_pull(task_ids="load_config", key="transfer_id") }}',
           dag=dag,
       )
   
   
   
   **[2023-01-31, 03:30:39 UTC] {compute_ssh.py:286} INFO - Importing SSH 
public key using OSLogin: [email protected]
   [2023-01-31, 03:30:39 UTC] {compute_ssh.py:236} INFO - Opening remote 
connection to host: username=sa_115585236623848451866, hostname=10.128.0.29
   [2023-01-31, 03:30:41 UTC] {transport.py:1874} ERROR - Exception (client): 
Error reading SSH protocol banner
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most 
recent call last):
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -   File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2271, 
in _check_banner
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -     buf = 
self.packetizer.readline(timeout)
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -   File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 380, in 
readline
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -     buf += 
self._read_timeout(timeout)
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -   File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 609, in 
_read_timeout
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -     raise EOFError()
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - EOFError
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - 
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - During handling of 
the above exception, another exception occurred:
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - 
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most 
recent call last):
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -   File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2094, 
in run
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -     
self._check_banner()
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -   File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2275, 
in _check_banner
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -     raise 
SSHException(
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - 
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
   [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - 
   [2023-01-31, 03:30:41 UTC] {compute_ssh.py:258} INFO - Failed to connect. 
Waiting 0s to retry
   [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Connected (version 
2.0, client OpenSSH_8.9p1)
   [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Authentication 
(publickey) failed.
   [2023-01-31, 03:30:43 UTC] {compute_ssh.py:258} INFO - Failed to connect. 
Waiting 1s to retry
   [2023-01-31, 03:30:47 UTC] {transport.py:1874} INFO - Connected (version 
2.0, client OpenSSH_8.9p1)
   [2023-01-31, 03:30:50 UTC] {transport.py:1874} INFO - Authentication 
(publickey) failed.
   [2023-01-31, 03:30:50 UTC] {compute_ssh.py:258} INFO - Failed to connect. 
Waiting 6s to retry
   [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Connected (version 
2.0, client OpenSSH_8.9p1)
   [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Authentication 
(publickey) failed.
   [2023-01-31, 03:30:58 UTC] {taskinstance.py:1904} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py",
 line 157, in execute
       with self.get_ssh_client() as ssh_client:
     File 
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py",
 line 124, in get_ssh_client
       return self.get_hook().get_conn()
     File 
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
 line 232, in get_conn
       sshclient = self._connect_to_instance(user, hostname, privkey, 
proxy_command)
     File 
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
 line 245, in _connect_to_instance
       client.connect(
     File 
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
 line 50, in connect
       return super().connect(*args, **kwargs)
     File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 
450, in connect
       self._auth(
     File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 
781, in _auth
       raise saved_exception
     File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 
681, in _auth
       self._transport.auth_publickey(username, pkey)
     File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", 
line 1635, in auth_publickey
       return self.auth_handler.wait_for_response(my_event)
     File 
"/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line 
259, in wait_for_response
       raise e
   paramiko.ssh_exception.AuthenticationException: Authentication failed.
   [2023-01-31, 03:30:58 UTC] {taskinstance.py:1408} INFO - Marking task as 
UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag, 
task_id=create_transfer_run_directory, execution_date=20230131T033002, 
start_date=20230131T033035, end_date=20230131T033058
   [2023-01-31, 03:30:58 UTC] {standard_task_runner.py:92} ERROR - Failed to 
execute job 1418 for task create_transfer_run_directory (Authentication 
failed.; 21885)**
   
   ### What you think should happen instead
   
   The SSH Hook operator should be able to seamlessly SSH into the GCE instance 
without any intermittent authentication issues
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Composer Kubernetes Cluster
   
   ### Versions of Apache Airflow Providers
   
   Composer Version - 2.1.3
   Airflow version - 2.3.4
   
   ### Deployment
   
   Composer
   
   ### Deployment details
   
   Kubernetes Cluster
   GCE Compute Engine VM (Ubuntu)
   
   ### Anything else
   
   Very random and intermittent
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to