I am fine with the explanation - looks good, On Sat, Aug 16, 2025 at 4:59 AM Zhe-You Liu <zhu424....@gmail.com> wrote:
> Thanks Daniel, Jarek, > > Sorry, I should have described the background of this proposal more clearly > and also mentioned the back-compatibility plan. > > Additionally, the title should be **"Remove `log_pos` metadata and > `continuation_token` logic for `get_log` API"**, which better reflects the > proposal. > The JSON format will remain supported for compatibility. > > For Daniel's > > > Since if your change comes out in 3.2, then providers that are not pinned > >= 3.2 would still need to support the old ways, and the core > filetaskhandler might need to continue to support the old way until 4.0. > Right? > > This change will not break existing providers. The back-compatibility layer > for provider task log handlers was already completed in PR #49470 [1]. This > change will only affect the core FileTaskHandler. > > Providers are already planned to be refactored to support the new streaming > read interface (returning a generator of `str` or `StructuredLogMessage`) > instead of explicitly reading logs into a large list. > > For Jarek's > > > * 1st of all - was the "zip format" a mistake? I understand that we are > talking about switching to NDJSon only? (It can be .gzipped or whatever, > but that is more of the transport layer? > > Not a mistake, but a new feature. I should have separated the idea of "Zip > format" as an independent point. That is, supporting "download full logs in > Zip format" would be a new feature, unrelated to the streaming read API > change. > > > * 2ndly - do I understand that the proposal is to change this: > > Yes, this proposal is mainly focused on removing the `log_pos` logic, which > also implies removing the `continuation_token` logic. > > > And simply switch to reading the logs by the client line-by-line from the > API - just streaming the answer effectively and yielding line-by-line? Is > that the correct understanding? > > Yes, exactly. The TaskInstanceLog frontend page will only need to call the > `get_log` API once, and then it can stream logs line-by-line until > completion after the #54552 fix. > > Before the #54552 fix, the FileTaskHandler *could only read content already > flushed to the file and could not access content still being written.* > Therefore, a polling mechanism was needed to detect new file changes and > keep the API connection open for streaming content to the frontend. While > implementing this polling mechanism, I also realized that removing the > `log_pos` metadata is necessary so that the streaming API can return log > records immediately. Otherwise, the frontend would be blocked until the > entire log stream was finished. > ( Since we need `LogStreamAccumulator` to count the `log_pos`, but it will > flush the log stream to temp file then read them back. ) > > > * what would be the proposed timeline for that change ? > > Ideally released in Airflow 3.1, since removing the `log_pos` metadata and > `continuation_token` logic is a breaking change. > > > * what is the impact on current remote log handlers? > > No impact on current remote log handlers. Only the core FileTaskHandler and > `get_log` API will be affected. > > > Are all of them supporting direct streaming **today** or do we need to > complete any implementation there? > > Currently, only the CloudWatch log handler is in progress for supporting > the new streaming read interface [2]. Other remote log handlers are planned > to support the new interface, but are not yet implemented [3]. > > > * will that change work with older versions of providers (for those > handlers) or would we have to make Airflow (after it is implemented) only > work with newer versions of those? > > The back-compatibility layer for provider task log handlers was already > completed in PR #49470 [1]. > > > * I guess retrieving logs via the api is something our users might use > heavily and I understa that means incompatible API change, and necessity to > make our users modify their APIs (in case they use Json format? Or even if > they use ndjson?) Is there any way to provide any compatibility solution ? > > The JSON format will remain supported for compatibility, but the `log_pos` > metadata and `continuation_token` logic will be removed. > > > So my question is really - how much this change is on the scale of pain > for > > Not much for maintainers, since the back-compatibility layer for remote log > handlers is already implemented. We also have a plan to support the new > streaming read interface across remote log handlers. > > For users, this will be a breaking change for the `get_log` API in JSON > format due to the removal of `log_pos` metadata and `continuation_token` > logic. The `ndjson` format will remain unchanged, but with the fix to the > streaming issue, the API will now truly stream logs until the end and keep > the connection open until the log is fully read. > > Thanks again for your feedback. > I hope this clarifies the proposal and the back-compatibility plan. > > [1]: > > https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/utils/log/file_task_handler.py#L613-L630 > [2]: https://github.com/apache/airflow/pull/54054 > [3]: https://github.com/apache/airflow/issues/45079 ( under the "TODO > Tasks" section ) > > On Sat, Aug 16, 2025 at 12:49 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > > Let me rephrase the proposal as I understand it (it is loaded with a few > > terms and code and PRs so I think everyone would benefit with a "lame" > > description of someone who tries to understand it. > > > > * 1st of all - was the "zip format" a mistake? I understand that we are > > talking about switching to NDJSon only? (It can be .gzipped or whatever, > > but that is more of the transport layer? > > * 2ndly - do I understand that the proposal is to change this: > > ``` > > def get_log( > > ... > > accept: HeaderAcceptJsonOrNdjson, > > token: .. > > ``` > > > > into this (also making it a streaming API): > > > > ``` > > def get_log( > > ... > > accept: HeaderAcceptNdjson, > > > no token > > ``` > > > > And simply switch to reading the logs by the client line-by-line from the > > API - just streaming the answer effectively and yielding line-by-line? > Is > > that the correct understanding? > > > > If that's a correct understanding (or maybe regardless) I have few > > questions: > > > > * what would be the proposed timeline for that change ? > > * what is the impact on current remote log handlers? Are all of them > > supporting direct streaming **today** or do we need to complete any > > implementation there? > > * will that change work with older versions of providers (for those > > handlers) or would we have to make Airflow (after it is implemented) only > > work with newer versions of those? > > * I guess retrieving logs via the api is something our users might use > > heavily and I understa that means incompatible API change, and necessity > to > > make our users modify their APIs (in case they use Json format? Or even > if > > they use ndjson?) Is there any way to provide any compatibility > solution ? > > Maybe we could have some compatibility shim / extra package (?) that > could > > provide the compatibility layer but would be removed out of the core? > > > > I think any of such change should be considered from two angles: > > > > * pain to maintainers > > * pain to users > > > > So my question is really - how much this change is on the scale of pain > for > > users / maintainers. > > > > J. > > > > > > > > On Fri, Aug 15, 2025 at 6:20 PM Zhe-You(Jason) Liu <jason...@apache.org> > > wrote: > > > > > Hi all, > > > > > > I would like to propose a change to the get_log API [1]: removing the > > > log_pos metadata and discontinuing the JSON response format. > > > > > > Following the update “Update useLog to support application/x-ndjson > > #54445” > > > [2], the frontend will adopt the application/x-ndjson format, which is > > more > > > efficient for streaming logs. Additionally, with the upcoming fix > > “Support > > > streaming log to the end for get_log API #54552” [3] (currently still a > > > draft PR), the get_log API will support streaming logs to completion. > > This > > > enhancement makes the log_pos metadata and the continuation_token logic > > [4] > > > unnecessary. > > > > > > In the previous update, “Resolve OOM When Reading Large Logs in > Webserver > > > #49470” [5], I introduced LogStreamAccumulator [6] to handle the > log_pos > > > metadata by flushing the log stream to temporary files and reading them > > > back. However, with the new streaming support, we can now yield the log > > > stream directly to the end in a single API call, improving performance > > and > > > reducing complexity. > > > > > > Benchmark results [7] show that, even after the #49470 refactor, using > > the > > > JSON format with get_log can still cause OOM issues when reading large > > > logs. Instead of continuing to support the JSON format, I suggest > > > *replacing > > > the JSON format with a Zip format*. This would allow users to > > conveniently > > > download full logs while maintaining memory efficiency for the API > > server. > > > > > > I appreciate your consideration of this proposal and look forward to > your > > > feedback. > > > > > > Thank you! > > > > > > Best regards, > > > Jason > > > > > > [1]: > > > > > > > > > https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/api_fastapi/core_api/routes/public/log.py#L75 > > > [2]: https://github.com/apache/airflow/pull/54445 > > > [3]: https://github.com/apache/airflow/pull/54552 > > > [4]: > > > > > > > > > https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/api_fastapi/core_api/routes/public/log.py#L166 > > > [5]: https://github.com/apache/airflow/pull/49470 > > > [6]: > > > > > > > > > https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/utils/log/log_stream_accumulator.py > > > [7]: > > https://github.com/apache/airflow/pull/49470#issuecomment-2908306229 > > > > > >