Nicolapps opened a new pull request, #1746:
URL: https://github.com/apache/jackrabbit-oak/pull/1746
This pull request modifies the `SegmentWriter` interface in oak-segment-tar
to add the possibility of reading the state of a segment currently being
written to, as described in OAK-11154.
Closes OAK-11154
## Why?
oak-segment-tar writes new segments using an implementation of
`SegmentWriter`.
Since segments are immutable, the state of a segment that hasn’t been
flushed yet isn’t visible outside of the `SegmentWriter` instance. However, in
some cases, code using `SegmentWriter` might want to access the partial segment
data.
Currently, the only possible way to do it is to call `flush`, which will
force the segment to be flushed right away, and then get the full segment from
the underlying segment store. This is bad for performance, because we need to
do more flushes that necessary, and because there’s a risk of creating a lot of
segments that have a size much smaller than `MAX_SEGMENT_SIZE`.
To avoid this, I suggest that we add a `readPartialSegmentState` method to
`SegmentWriter`, which takes the segment ID of an unflushed segment and returns
it if possible.
## Backwards-compatibility
This change is backwards-compatible with existing users of `SegmentWriter`
(because they’re not using the new method). The new method comes with a default
implementation which throws an `UnsupportedOperationException`.
## Concurrency
Previously, the class was marked as *not thread-safe*, which made sense
since it was only expected that a single writer thread uses it at the same time
(concurrent calls wouldn’t have made sense since the order in which `prepare`
and `writeXYZ` methods are called matters).
One major change with `SegmentBufferWriter` is that its
`readPartialSegmentState` method can now be called concurrently with the other
methods in the same class. To support this, we now use `synchronized` on the
methods that are accessible publicly. This shouldn’t cause a drop in
performance, because most calls to the class are on the writer thread (so not
concurrent between themselves), and it is expected from
`readPartialSegmentState` to be called rarely (compared to the other methods).
I could confirm that there is no noticeable drop in performance by running
the write benchmarks without and with the change, and observed no difference:
**Without `synchronized`**
```
# ConcurrentWriteReadTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 1 5 11 61 258
2535 24
# ConcurrentWriteTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 29 31 36 58 622
1373 44
# BasicWriteTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 14 15 16 19 320
3448 17
```
**With `synchronized`**
```
# ConcurrentWriteReadTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 1 4 12 58 553
2531 24
# ConcurrentWriteTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 29 31 35 65 461
1319 46
# BasicWriteTest C min 10% 50% 90% max
N mean
Oak-Segment-Tar 1 14 15 16 19 96
3444 17
```
## Testing
The PR adds a new test, `readPartialSegmentState`, which covers the
implementation of the method in `SegmentBufferWriter`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]