Re: [PR] Add metrics about task CPU and memory usage [airflow]

via GitHub Wed, 15 May 2024 12:13:58 -0700


o-nikolas commented on code in PR #39650:
URL: https://github.com/apache/airflow/pull/39650#discussion_r1602123572



##########
airflow/task/task_runner/standard_task_runner.py:
##########
@@ -53,6 +55,11 @@ def start(self):
         else:
             self.process = self._start_by_exec()
 
+        if self.process:
+            log_reader = threading.Thread(target=self._read_task_utilization)

Review Comment:
   `log_reader` is a strange name for this, it's not really reading any logs?
   
   maybe `resource_monitor`?



##########
airflow/task/task_runner/standard_task_runner.py:
##########
@@ -53,6 +55,11 @@ def start(self):
         else:
             self.process = self._start_by_exec()
 
+        if self.process:

Review Comment:
   Do we want to add an option to enable/disable this feature? It's adding a 
new thread to every task execution which I could see some users not loving.



##########
airflow/task/task_runner/standard_task_runner.py:
##########
@@ -186,3 +193,12 @@ def get_process_pid(self) -> int:
         if self.process is None:
             raise RuntimeError("Process is not started yet")
         return self.process.pid
+
+    def _read_task_utilization(self):
+        while True:
+            dag_id = self._task_instance.dag_id
+            task_id = self._task_instance.task_id
+            mem_usage = self.process.memory_percent()
+            cpu_usage = self.process.cpu_percent(interval=1)

Review Comment:
   Interesting, `interval=1` is what rate limits this while loop from running 
too frequently.
   
   I see this in the docs:
   
   >When interval is > 0.0 compares process times to system CPU times elapsed 
before and after the interval (blocking). When interval is 0.0 or None compares 
process times to system CPU times elapsed since last call, returning 
immediately. That means the first time this is called it will return a 
meaningless 0.0 value which you are supposed to ignore. **In this case is 
recommended for accuracy that this function be called a second time with at 
least 0.1 seconds between calls.**
   
   Should we not follow that recommendation? Also is 1 second a bit too tight 
of a loop? I think that's going to be quite spamy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add metrics about task CPU and memory usage [airflow]

Reply via email to