[GitHub] [airflow] dstandish commented on a diff in pull request #28818: Throttle streaming log reads

GitBox Tue, 10 Jan 2023 11:20:55 -0800


dstandish commented on code in PR #28818:
URL: https://github.com/apache/airflow/pull/28818#discussion_r1066213230



##########
airflow/utils/log/log_reader.py:
##########
@@ -77,12 +78,16 @@ def read_log_stream(self, ti: TaskInstance, try_number: int 
| None, metadata: di
             metadata.pop("max_offset", None)
             metadata.pop("offset", None)
             metadata.pop("log_pos", None)
-            while "end_of_log" not in metadata or (
-                not metadata["end_of_log"] and ti.state not in [State.RUNNING, 
State.DEFERRED]
-            ):
+            while True:
                 logs, metadata = self.read_log_chunks(ti, current_try_number, 
metadata)
                 for host, log in logs[0]:
                     yield "\n".join([host or "", log]) + "\n"
+                if "end_of_log" not in metadata or (
+                    not metadata["end_of_log"] and ti.state not in 
[State.RUNNING, State.DEFERRED]
+                ):
+                    time.sleep(0.5)

Review Comment:
   yes, i should add a comment for sure that's a good point.
   separately, do you think we should treat other handlers differently?  
generally they are gonna be network handlers i think. if they are just FTH it's 
about unnecessary CPU only.  but for remote (probably most common) its also 
about external services request throttling.  i the logs reader has a timeout of 
5 minutes so if user leaves logs tab open when say the task is not emitting 
logs for that time, it will keep hitting ES as fast as it can. suppose there 
are many users doing at same time?  admittedly probably not likely to cause 
serious problem IRL. but i think i do remember a report on slack about cluster 
load and i imagine this could be part of it.
   it does not require long sleeping tasks either -- every time it's waiting 
for new logs, it's looping. so it seems like a reasonable thing to do but 
sincerely interested in what you think (apart from the mechanics of making this 
mergeable)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on a diff in pull request #28818: Throttle streaming log reads

Reply via email to