uncleGen edited a comment on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558548047
 
 
   IMHO, the core problem is the compact metadata log grows bigger and bigger, 
and it is a time-consuming work to compact the metadata log. So why not limit 
the size of compact metadata log? I mean we can split the compact log to 
multiple log files. We may just need write limited sink file path into single 
compact log. And every 10 batches, we just compact delta logs with latest 
compact log. 
   
   ```
   batch N:
   
   10 11 12.compact
   
   batch N+10:
   
   10 11 12.compact 13 14 ... 22.compact.1 22.compact
   
   batch N+20:
   
   10 11 12.compact 13 14 ... 22.compact.1 22.compact 23 24 ... 32.compact.1 
32.compact
   
   batch N+30:
   
   10 11 12.compact 13 14 ... 22.compact.1 22.compact 23 24 ... 32.compact.1 
32.compact 33 34 ... 42.compact.2 42.compact.1 42.compact
   ```
    
   In `batch N+10`, we rename `12.compact` to `22.compact`, and compact delta 
log (13,14 ... 21) into `22.compact`. If the size of `22.compact` is beyond the 
limit, we need to `22.compact` to `22.compact.1` and then create a new 
`22.compact`. The cost of `rename` is low or not higher than writing to new 
file. And we just need to compact delta logs with latest compact log file. The 
biggest gain is the time cost of compacting is predictable, and does not scales 
linearly for increases.
   
   But this may broke the compatibility when use old version spark to read 
stream sink files. This is just my rough ideas, please advise if there is any 
mistake.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to