baryluk commented on a change in pull request #17649:
URL: https://github.com/apache/airflow/pull/17649#discussion_r690765281



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -217,7 +223,7 @@ def base_container_is_running(self, pod: V1Pod):
             return False
         return status.state.running is not None
 
-    @tenacity.retry(stop=tenacity.stop_after_attempt(3), 
wait=tenacity.wait_exponential(), reraise=True)
+    @tenacity.retry(stop=tenacity.stop_after_attempt(4), 
wait=tenacity.wait_exponential(), reraise=True)

Review comment:
       I can revert that part for now. It does help a little, but with the 
retry that will be now attempted if needed in the outer loop (as long as the 
pod is alive) it indeed seems not necessary.

##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -143,12 +143,21 @@ def monitor_pod(self, pod: V1Pod, get_logs: bool) -> 
Tuple[State, V1Pod, Optiona
             read_logs_since_sec = None
             last_log_time = None
             while True:
-                logs = self.read_pod_logs(pod, timestamps=True, 
since_seconds=read_logs_since_sec)
-                for line in logs:
-                    timestamp, message = 
self.parse_log_line(line.decode('utf-8'))
-                    self.log.info(message)
-                    if timestamp:
-                        last_log_time = timestamp
+                try:
+                    logs = self.read_pod_logs(pod, timestamps=True, 
since_seconds=read_logs_since_sec)
+                    for line in logs:
+                        timestamp, message = 
self.parse_log_line(line.decode('utf-8'))
+                        self.log.info(message)
+                        if timestamp:
+                            last_log_time = timestamp
+                except Exception as e:

Review comment:
       I wanted something broad, `Exception` might be a bit too much. But there 
might be more things beyond TimeoutError, i.e. dns issue, authorization issue, 
ssl errors, protocol error, etc. Basically all of this should be ignored, check 
if pod is still alive, and retry later.
   
   `TimeoutError` would definitively help in my use case, but I think it is a 
bit too narrow, and also relays on the fact kube client is using urllib3 
internally, which might not be the case in the future.

##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -143,12 +143,21 @@ def monitor_pod(self, pod: V1Pod, get_logs: bool) -> 
Tuple[State, V1Pod, Optiona
             read_logs_since_sec = None
             last_log_time = None
             while True:
-                logs = self.read_pod_logs(pod, timestamps=True, 
since_seconds=read_logs_since_sec)
-                for line in logs:
-                    timestamp, message = 
self.parse_log_line(line.decode('utf-8'))
-                    self.log.info(message)
-                    if timestamp:
-                        last_log_time = timestamp
+                try:
+                    logs = self.read_pod_logs(pod, timestamps=True, 
since_seconds=read_logs_since_sec)
+                    for line in logs:
+                        timestamp, message = 
self.parse_log_line(line.decode('utf-8'))
+                        self.log.info(message)
+                        if timestamp:
+                            last_log_time = timestamp
+                except Exception as e:

Review comment:
       Ok. Used `urllib3.exceptions.HTTPError` which is a base for many socket 
related errors in `urllib3` including connection, protocol, timeouts, response 
parsing errors, etc.
   
   Should be good.

##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -217,7 +223,7 @@ def base_container_is_running(self, pod: V1Pod):
             return False
         return status.state.running is not None
 
-    @tenacity.retry(stop=tenacity.stop_after_attempt(3), 
wait=tenacity.wait_exponential(), reraise=True)
+    @tenacity.retry(stop=tenacity.stop_after_attempt(4), 
wait=tenacity.wait_exponential(), reraise=True)

Review comment:
       Removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to