HeartSaVioR commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234


   Just to make sure I did go through the idea of @uncleGen proposed, I'll 
elaborate why I didn't go through actual implementation.
   
   While there's effectively NO deletion of the entry on file stream 
source/sink, it's definitely needed to have "retention" on both, so 
consideration of deleting entries was required. Once we read previous file and 
make modification, then it defeats the good side of the idea (no read and no 
rewrite on previous compact batch).
   
   There's a way to achieve both of the goods, via splitting compact batch file 
per timestamp (the unit would be day, hour, etc and the unit defines the 
granularity of retention) and delete compact batch file based on the retention. 
It's a bit complicated to implement, and size of the each compact batch file 
would be out of control.
   
   The idea may be still valid but if we would like to make it work nicely it 
should have been designed thoughtfully.
   
   Compared to that idea, this patch is no-brainer, inherits shortcoming of 
current compact behavior but can do the job pretty much faster (~10x). Based on 
end users' instinct about volume of the outputs per day and good value of 
retention, they can roughly control the size of the compact batch file.
   (It still won't work if the volume is gigantic or they need to set pretty 
high retention, but that's the time we propose trying out alternatives. 
Currently it simply doesn't work or makes end users struggling with normal 
workload.)
   
   Even with this patch & retention, memory issue may still exist, as this 
patch doesn't help reducing memory usage on compact. Current file stream 
source/sink requires to materialize all entries into memory during compact, 
maybe also during get. The problem is more specific to file stream sink, as the 
size of entry are much bigger, and even with retention a compact batch would 
have bunch of entries. Addressing issue on get is hard, but addressing issue on 
compact would be relatively easier, and helps file stream sink to avoid OOME 
during compact phase. Next item to do.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to