HeartSaVioR commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234
Just to make sure I did go through the idea of @uncleGen proposed, I'll elaborate why I didn't go through actual implementation. While there's effectively NO deletion of the entry on file stream source/sink, it's definitely needed to have "retention" on both, so consideration of deleting entries was required. Once we read previous file and make modification, then it defeats the good side of the idea (no read and no rewrite on previous compact batch). There's a way to achieve both of the goods, via splitting compact batch file per timestamp (the unit would be day, hour, etc and the unit defines the granularity of retention) and delete compact batch file based on the retention. It's a bit complicated to implement, and size of the each compact batch file would be out of control. The idea may be still valid but if we would like to make it work nicely it should have been designed thoughtfully. Compared to that idea, this patch is no-brainer, inherits shortcoming of current compact behavior but can do the job pretty much faster (~10x). Based on end users' instinct about volume of the outputs per day and good value of retention, they can roughly control the size of the compact batch file. (It still won't work if the volume is gigantic or they need to set pretty high retention, but that's the time we propose trying out alternatives. Currently it simply doesn't work or makes end users struggling with normal workload.) Even with this patch & retention, memory issue may still exist, as this patch doesn't help reducing memory usage on compact. Current file stream source/sink requires to materialize all entries into memory during compact, maybe also during get. The problem is more specific to file stream sink, as the size of entry are much bigger, and even with retention a compact batch would have bunch of entries. Addressing issue on get is hard, but addressing issue on compact would be relatively easier, and helps file stream sink to avoid OOME during compact phase. Next item to do. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
