potiuk commented on issue #31105: URL: https://github.com/apache/airflow/issues/31105#issuecomment-1605911398
> We can assign hard-limits and that would solve this current issue. We can just sum filesizes before loading any of them (or most of them, I do not know we can get all of their sizes) into memory. Or count rows if can't get the sizes - that should be easier. > But, I believe even for file sizes smaller than that we need not load the whole file into memory and sort for every auto-tailing call. Correct - not having to join the files in-memory will defininitely decrease the memory reuirements for webserver even for smaller files. > We can do this. Now, this will be in sorted order however if the task is still running new logs could have come in the time we sent our response. Even for this, we need to maintain a log position for different log streams and call the reading methods with appropriate metadata to update this temp file. Agree. I think the log pos in metadata in this case will be a bit tricky and should be "per stream"/ "user". And there will be cases where someone just aut-tails the logs (in which cases it is fine to keep returned data in memory), but for cases when someone reads log from the beginning, the size might still be substantial (so keeping it in a file makes sense). And we should likely have a separate path for cases like S3 if we want to further optimize it - for cases where remote log is never streaming because for them we can either have a full log file or nothing (this is object, not file storage so, for S3 we will not even see a log until it is complete). In this case we should not worry about "tailing" the log. And we could use some caching (at the expense of isolation) so that the log from s3 is downloaded only once per "webserver" and kept for some time (and reused between userss/sessions) so that if few people look at the same task log or hit refresh button, the "remote" reading of that file will not happen over and over again. > And for some of the methods we still need to load the whole file into memory like the HDFS-one. For them we could filter out after loading into memory. Hmm. Not sure how HDFS works in thie case - and not how much "streaming" vs. "static" - i.e. cannot change once it is published - caching it should help as well. Even if it is "appendable" but we cannot stream it, maybe there is a way to store a hash or mtime of the file and only pull it if it changed ? But even in this case - I guess we do not need to pull the whole file in memory - maybe we would need to update HDFS hook for that, but I can't imagine the case where we have to load the whole file to memory in order to pull it from remote. > I believe we can do this as it could reduce memory-usage and network congestion and we are sending metadata back-and-forth anyways might as well send the log positions of a few more files. What do you think? Yeah. I think we could do actually both - save logs on disks and also have some limits, becuase that's not only memory but networking + I/O + time that is saved this way. Plus if we do some caching, we can also optimise cases where various people look at the same logs. Overall I think it is all "doable" and even if we don't implement and handle all cases, it can be gradually optimized. I'd love to hear also what @dstandish has to say about it :) if he has time to read all the discussion that is :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
