HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in 
ManifestFileCommitter to delete files in the abort function. Or we can come up 
with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but 
this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that 
reading output files in directory directly doesn't only mean you're reading 
duplicated outputs (at least once). This also means there's a chance you may be 
reading incomplete/corrupted files as well, in some sort of crashing during 
write.
   
   Stepping back to explain the rationalization and the goal - providing 
holistic solution is not a goal. 
   
   There're already lots of efforts being made to provide holistic solution, 
though these efforts are happening "outside" of Spark codebase, Delta Lake, 
Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if 
I try to address the entire problems, which I can't persuade anyone to provide 
me enough time to work.
   (Please refer the comment which is the feedback I got - 
https://github.com/apache/spark/pull/27694#issuecomment-651454246)
   
   My goal of the overall improvements in file stream source/sink is, enabling 
end users run the query a bit longer without weird issues (like OOM). I just 
had to take easier (and limited) approach to solve the issue. e.g. regarding 
growing entries issue, while alternatives support data compaction to reduce the 
overall number of files without losing anything, I proposed retention of the 
output files so that older files can be excluded in metadata log but they no 
longer be accessible. End users need to pick up alternatives once they cannot 
live with the limitations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to