aagateuip opened a new issue, #32111:
URL: https://github.com/apache/airflow/issues/32111
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
We have seen that KubernetesPodOperator sometimes fails to retrieve json
from xcom sidecar container due to network connectivity issues or in some cases
retrieves incomplete json which cannot be parsed. The KubernetesPodOperator
task then fails with these error stack traces
e.g.
`File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 398, in execute
result = self.extract_xcom(pod=self.pod)
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 372, in extract_xcom
result = self.pod_manager.extract_xcom(pod)
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
line 369, in extract_xcom
_preload_content=False,
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/stream.py",
line 35, in _websocket_request
return api_method(*args, **kwargs)
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py",
line 994, in connect_get_namespaced_pod_exec
return self.connect_get_namespaced_pod_exec_with_http_info(name, namespace,
**kwargs) # noqa: E501
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py",
line 1115, in connect_get_namespaced_pod_exec_with_http_info
collection_formats=collection_formats)
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py",
line 353, in call_api
_preload_content, _request_timeout, _host)
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py",
line 184, in __call_api
_request_timeout=_request_timeout)
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/ws_client.py",
line 518, in websocket_call
raise ApiException(status=0, reason=str(e))
kubernetes.client.exceptions.ApiException: (0)
Reason: Connection to remote host was lost.`
OR
`
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 398, in execute
result = self.extract_xcom(pod=self.pod)
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 374, in extract_xcom
return json.loads(result)
File "/usr/local/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column
4076 (char 4075)
`
We are using airflow 2.6.1 and
apache-airflow-providers-cncf-kubernetes==4.0.2
### What you think should happen instead
KubefrnetesPodOperator should not fail with these intermittent network
connectivity issues when pulling json from xcom sidecar container. It should
have retries and verify whether it was able to retrieve valid json before it
kills the xcom side car container,
extract_xcom should
* Read and try to parse return json when its read from
/airflow/xcom/return.json - to catch errors if say due to network connectivity
issues we don not read incomplete json (truncated json)
* Add retries when we read the json - hopefully it will also catch against
other network errors to (with kubernetes stream trying to talk to airflow
container to get return json)
* Only if the return json can be read and parsed (if its valid) now the
code goes ahead and kills the sidecar container.
### How to reproduce
This occurs intermittently so is hard to reproduce. Happens when the
kubernetes cluster is under load. In 7 days we see this happen once or twice.
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
airflow 2.6.1 and apache-airflow-providers-cncf-kubernetes==4.0.2
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
This occurs intermittently so is hard to reproduce. Happens when the
kubernetes cluster is under load. In 7 days we see this happen once or twice.
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]