potiuk commented on issue #31105:
URL: https://github.com/apache/airflow/issues/31105#issuecomment-1605911398

   > We can assign hard-limits and that would solve this current issue. We can 
just sum filesizes before loading any of them (or most of them, I do not know 
we can get all of their sizes) into memory. 
   
   Or count rows if can't get the sizes - that should be easier. 
   
   
   > But, I believe even for file sizes smaller than that we need not load the 
whole file into memory and sort for every auto-tailing call.
   
   Correct - not having to join the files in-memory will defininitely decrease 
the memory reuirements for webserver even for smaller files. 
   
   > We can do this. Now, this will be in sorted order however if the task is 
still running new logs could have come in the time we sent our response. Even 
for this, we need to maintain a log position for different log streams and call 
the reading methods with appropriate metadata to update this temp file. 
   
   Agree. I think the log pos in metadata in this case will be a bit tricky and 
should be "per stream"/ "user".  And there will be cases where someone just 
aut-tails the logs (in which cases it is fine to keep returned data in memory), 
but for cases when someone reads log from the beginning, the size might still 
be substantial (so keeping it in a file makes sense). 
   
   And we should likely have a separate path for cases like S3 if we want to 
further optimize it - for cases where remote log is never streaming because for 
them we can either have a full log file or nothing (this is object, not file 
storage so, for S3 we will not even see a log until it is complete). In this 
case we should not worry about "tailing" the log. And we could use some caching 
(at the expense of isolation) so that the log from s3 is downloaded only once 
per "webserver" and kept for some time (and reused between userss/sessions) so 
that if few people look at the same task log or hit refresh button, the 
"remote" reading of that file will not happen over and over again.
   
   > And for some of the methods we still need to load the whole file into 
memory like the HDFS-one. For them we could filter out after loading into 
memory. 
   
   Hmm. Not sure how HDFS works in thie case - and not how much "streaming" vs. 
"static" - i.e. cannot change once it is published - caching it should help as 
well. Even if it is "appendable" but we cannot stream it, maybe there is a way 
to store a hash or mtime of the file and only pull it if it changed ? But even 
in this case - I guess we do not need to pull the whole file in memory - maybe 
we would need to update HDFS hook for that, but I can't imagine the case where 
we have to load the whole file to memory in order to pull it from remote.
   
   > I believe we can do this as it could reduce memory-usage and network 
congestion and we are sending metadata back-and-forth anyways might as well 
send the log positions of a few more files. What do you think?
   
   Yeah. I think we could do actually both - save logs on disks and also have 
some limits, becuase that's not only memory but networking + I/O + time that is 
saved this way. Plus if we do some caching, we can also optimise cases where 
various people look at the same logs. 
   
   Overall I think it is all "doable" and even if we don't implement and handle 
all cases, it can be gradually optimized.
   
   I'd love to hear also what @dstandish has to say about it :) if he has time 
to read all the discussion that is :)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to