HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r452217417
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was
$defaultCompactInterval) " +
"to a positive value.")
- override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
- val deletedFiles = logs.filter(_.action ==
FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
- if (deletedFiles.isEmpty) {
- logs
- } else {
- logs.filter(f => !deletedFiles.contains(f.path))
- }
- }
+ override def shouldRetain(log: SinkFileStatus): Boolean = true
Review comment:
Please note that after the patch we are no longer able to match
ADD_ACTION and DELETE_ACTION in compaction, so that's the responsibility of
"reader" to exclude entities which have both ADD_ACTION and DELETE_ACTION, even
we take DELETE_ACTION into account.
Returning `false` for DELETE_ACTION here simply drops the entities for
DELETE_ACTION without dropping ADD_ACTION, hence some files would be
resurrected on the reader side after compaction, making problems. Even
considering the behavior, compact batch should remove both entities for
ADD_ACTION and DELETE_ACTION (probably with a new ADD_ACTION on new compacted
file under data compaction), not only DELETE_ACTION.
DELETE_ACTION is introduced years ago and we still haven't touched it. I
think we should just defer and shouldn't deal with the thing we don't have and
also don't have a clear plan to address. Once we design the new behavior, we
would know what we should do. Dealing with something according to the
imagination blocks us to move forward.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]