HeartSaVioR edited a comment on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558976957
 
 
   > IMHO, the core problem is the compact metadata log grows bigger and 
bigger, and it is a time-consuming work to compact the metadata log, because it 
will read old compact log file and then write to new compact log file.
   
   I agree with you that the problem is that compact metadata log just grows 
most of the times, though taking plenty of time building metadata log is just a 
one of multiple major issues. The other major issue, reading metadata log won't 
decrease unless we optimize the format of file or just get rid of entities like 
this patch is proposing.
   
   One thing we have to consider is, when `compact` phase happens, Spark is 
able to get rid of some entities which have been existing - that's the feature 
this patch leverages. That requires full read and rewrite of entities per each 
compact phase, and that's why we can't just simply add two compact files.
   
   Looks like `CompactibleFileStreamLog` is introduced to avoid "small files 
problem", which seems to be possible to tweak a bit to change the approach to 
maintain "ranged delta" (say, compacted delta among with range of batches) 
which might be more similar with what you proposed. That's no longer be a 
"snapshot" and that might lost ability (or be inefficient) to get rid of 
entities, but in most cases the entities are not removed so it also makes sense 
to me. I'm expecting the logic more complicated than current one, but that 
might be acceptable given the issue has been affecting badly for end users.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to