uncleGen edited a comment on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files URL: https://github.com/apache/spark/pull/24128#issuecomment-558548047 IMHO, the core problem is the compact metadata log grows bigger and bigger, and it is a time-consuming work to compact the metadata log. So why not limit the size of compact metadata log? I mean we can split the compact log to multiple log files. We may just need write limited sink file path into single compact log. And every 10 batches, we just compact delta logs with latest compact log. ``` batch N: 10 11 12.compact batch N+10: 10 11 12.compact 13 14 ... 22.compact.1 22.compact batch N+20: 10 11 12.compact 13 14 ... 22.compact.1 22.compact 23 24 ... 32.compact.1 32.compact batch N+30: 10 11 12.compact 13 14 ... 22.compact.1 22.compact 23 24 ... 32.compact.1 32.compact 33 34 ... 42.compact.2 42.compact.1 42.compact ``` In `batch N+10`, we rename `12.compact` to `22.compact`, and compact delta log (13,14 ... 21) into `22.compact`. If the size of `22.compact` is beyond the limit, we need to `22.compact` to `22.compact.1` and then create a new `22.compact`. The cost of `rename` is low or not higher than creating new file. And we just need to compact delta logs with latest compact log file. The biggest gain is the time cost of compacting is predictable, and does not scales linearly for increases. But this may broke the compatibility when use old version spark to read stream sink files. This is just my rough ideas, please advise if there is any mistake.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
