jason810496 opened a new pull request, #54552: URL: https://github.com/apache/airflow/pull/54552
related: - https://github.com/apache/airflow/pull/54445 - https://github.com/apache/airflow/pull/49470 ## Why The `get_log` API with the `application/nd-json` header should stream logs to the end, but the `FileTaskHandler` can only read content **that has already been flushed to the file and cannot access content that is still being written**. ## What There should be a polling mechanism to check for new changes to the file so that the API connection can remain open for streaming content to the frontend. - Add `stream_file_until_close` with `poll_interval = 0.1` and `idle_timeout = 10.0` seconds to keep streaming the file content until it remains unchanged for longer than the `idle_timeout` threshold. - Add time-based flushing for `_interleave_logs` so that logs are flushed based on a time interval, even if the heap size doesn't reach `HEAP_DUMP_SIZE`, to prevent frontend display delays. - Remove `log_pos` from log metadata. - Remove `LogStreamAccumulator` so that the API can raise log records as soon as possible. - Since `LogStreamAccumulator` flushes the log stream to a temp file to get the total log lines for the frontend, it requires waiting until the log stream ends before it can replay the log stream. ## Example Dag for Testing the `get_log` API Streaming Results ```python with DAG( dag_id="test_streaming_log", start_date=pendulum.datetime(2023, 1, 1, tz="UTC"), schedule=None, ): @task() def produce_slowly(): import time print("Large task logs DAG starting") for _ in range(800): print(uuid4()) time.sleep(1) print("Large task logs DAG done") return ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
