attilapiros commented on a change in pull request #33628:
URL: https://github.com/apache/spark/pull/33628#discussion_r771466221
##########
File path:
core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala
##########
@@ -184,4 +184,20 @@ class DiskBlockObjectWriterSuite extends SparkFunSuite
with BeforeAndAfterEach {
writer.close()
assert(segment.length === 0)
}
+
+ test("calling closeAndDelete() on a partial write file") {
+ val (writer, file, writeMetrics) = createWriter()
+
+ writer.write(Long.box(20), Long.box(30))
+ val firstSegment = writer.commitAndGet()
+ assert(firstSegment.length === file.length())
+ assert(writeMetrics.bytesWritten === file.length())
+
+ writer.write(Long.box(40), Long.box(50))
+
+ writer.closeAndDelete()
+ assert(!file.exists())
+ assert(writeMetrics.bytesWritten === firstSegment.length)
Review comment:
> Like
[9db7115](https://github.com/apache/spark/commit/9db7115fc980c80ee517f46e6844b39d76c93559)
changed?
Not exactly. I would keep track of the commited records and not the total
written records.
When the committed records is counted then you do not need to increase the
new var every time when a record is written but only when a huge number of
records are committed.
So this line is not needed:
https://github.com/apache/spark/blob/9db7115fc980c80ee517f46e6844b39d76c93559/core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala#L329
But you need to increase the new var after the line 232 before the reset of
`numRecordsWritten `:
https://github.com/apache/spark/blob/9db7115fc980c80ee517f46e6844b39d76c93559/core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala#L231-L233
This way when the file is remove you can decrease the metric with the sum of
the new var and `numRecordsWritten` at:
https://github.com/apache/spark/blob/9db7115fc980c80ee517f46e6844b39d76c93559/core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala#L291
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]