HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r452217417



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was 
$defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == 
FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true

Review comment:
       Please note that after the patch we are no longer able to match 
ADD_ACTION and DELETE_ACTION in compaction, so that's the responsibility of 
"reader" to exclude entities which have both ADD_ACTION and DELETE_ACTION, even 
we take DELETE_ACTION into account.
   
   Returning `false` for DELETE_ACTION here simply drops the entities for 
DELETE_ACTION without dropping ADD_ACTION, hence some files would be 
resurrected on the reader side after compaction, making problems. Even 
considering the behavior, compact batch should remove both entities for 
ADD_ACTION and DELETE_ACTION (probably with a new ADD_ACTION on new compacted 
file under data compaction), not only DELETE_ACTION.
   
   DELETE_ACTION is introduced years ago and we still haven't touched it. I 
think we should just defer and shouldn't deal with the thing we don't have and 
also don't have a clear plan to address. Once we design the new behavior, we 
would know what we should do. Dealing with something according to the 
imagination blocks us to move forward.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to