vasu2809 opened a new issue, #29258:
URL: https://github.com/apache/airflow/issues/29258
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
We are using ComputeEngineSSHHook for some of our Airflow DAGS in Cloud
Composer
Everything works fine when DAGs run one by one
But when we start parallelism where multiple tasks are trying to connect to
our GCE instance using ComputeEngineSSHHook at the same time,
We experience intermittent errors like the one give below
Since cloud composer by default has 3 retries, sometimes in the second or
third attempt this issue gets resolved automatically but we would like to
understand why this issue comes in the first place when there are multiple
operators trying to generate keys and SSH into GCE instance
We have tried maintaining the DAG task with banner_timeout and
expire_timeout parameters but we still see this issue
create_transfer_run_directory = SSHOperator(
task_id="create_transfer_run_directory",
ssh_hook=ComputeEngineSSHHook(
instance_name=GCE_INSTANCE,
zone=GCE_ZONE,
use_oslogin=True,
use_iap_tunnel=False,
use_internal_ip=True,
),
conn_timeout = 120,
cmd_timeout = 120,
banner_timeout = 120.0,
command=f"sudo mkdir -p {transfer_run_directory}/"
'{{ ti.xcom_pull(task_ids="load_config", key="transfer_id") }}',
dag=dag,
)
**[2023-01-31, 03:30:39 UTC] {compute_ssh.py:286} INFO - Importing SSH
public key using OSLogin: [email protected]
[2023-01-31, 03:30:39 UTC] {compute_ssh.py:236} INFO - Opening remote
connection to host: username=sa_115585236623848451866, hostname=10.128.0.29
[2023-01-31, 03:30:41 UTC] {transport.py:1874} ERROR - Exception (client):
Error reading SSH protocol banner
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most
recent call last):
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2271,
in _check_banner
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf =
self.packetizer.readline(timeout)
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 380, in
readline
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf +=
self._read_timeout(timeout)
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 609, in
_read_timeout
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise EOFError()
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - EOFError
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - During handling of
the above exception, another exception occurred:
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most
recent call last):
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2094,
in run
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -
self._check_banner()
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2275,
in _check_banner
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise
SSHException(
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
[2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR -
[2023-01-31, 03:30:41 UTC] {compute_ssh.py:258} INFO - Failed to connect.
Waiting 0s to retry
[2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Connected (version
2.0, client OpenSSH_8.9p1)
[2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Authentication
(publickey) failed.
[2023-01-31, 03:30:43 UTC] {compute_ssh.py:258} INFO - Failed to connect.
Waiting 1s to retry
[2023-01-31, 03:30:47 UTC] {transport.py:1874} INFO - Connected (version
2.0, client OpenSSH_8.9p1)
[2023-01-31, 03:30:50 UTC] {transport.py:1874} INFO - Authentication
(publickey) failed.
[2023-01-31, 03:30:50 UTC] {compute_ssh.py:258} INFO - Failed to connect.
Waiting 6s to retry
[2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Connected (version
2.0, client OpenSSH_8.9p1)
[2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Authentication
(publickey) failed.
[2023-01-31, 03:30:58 UTC] {taskinstance.py:1904} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py",
line 157, in execute
with self.get_ssh_client() as ssh_client:
File
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py",
line 124, in get_ssh_client
return self.get_hook().get_conn()
File
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
line 232, in get_conn
sshclient = self._connect_to_instance(user, hostname, privkey,
proxy_command)
File
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
line 245, in _connect_to_instance
client.connect(
File
"/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py",
line 50, in connect
return super().connect(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line
450, in connect
self._auth(
File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line
781, in _auth
raise saved_exception
File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line
681, in _auth
self._transport.auth_publickey(username, pkey)
File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py",
line 1635, in auth_publickey
return self.auth_handler.wait_for_response(my_event)
File
"/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line
259, in wait_for_response
raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed.
[2023-01-31, 03:30:58 UTC] {taskinstance.py:1408} INFO - Marking task as
UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag,
task_id=create_transfer_run_directory, execution_date=20230131T033002,
start_date=20230131T033035, end_date=20230131T033058
[2023-01-31, 03:30:58 UTC] {standard_task_runner.py:92} ERROR - Failed to
execute job 1418 for task create_transfer_run_directory (Authentication
failed.; 21885)**
### What you think should happen instead
The SSH Hook operator should be able to seamlessly SSH into the GCE instance
without any intermittent authentication issues
### How to reproduce
_No response_
### Operating System
Composer Kubernetes Cluster
### Versions of Apache Airflow Providers
Composer Version - 2.1.3
Airflow version - 2.3.4
### Deployment
Composer
### Deployment details
Kubernetes Cluster
GCE Compute Engine VM (Ubuntu)
### Anything else
Very random and intermittent
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]