[PR] OAK-11154: Read partial segments from SegmentWriter [jackrabbit-oak]

via GitHub Fri, 27 Sep 2024 06:25:02 -0700


Nicolapps opened a new pull request, #1746:
URL: https://github.com/apache/jackrabbit-oak/pull/1746


   This pull request modifies the `SegmentWriter` interface in oak-segment-tar 
to add the possibility of reading the state of a segment currently being 
written to, as described in OAK-11154. 
   
   Closes OAK-11154
   
   ## Why?
   oak-segment-tar writes new segments using an implementation of 
`SegmentWriter`.
   
   Since segments are immutable, the state of a segment that hasn’t been 
flushed yet isn’t visible outside of the `SegmentWriter` instance. However, in 
some cases, code using `SegmentWriter` might want to access the partial segment 
data.
   
   Currently, the only possible way to do it is to call `flush`, which will 
force the segment to be flushed right away, and then get the full segment from 
the underlying segment store. This is bad for performance, because we need to 
do more flushes that necessary, and because there’s a risk of creating a lot of 
segments that have a size much smaller than `MAX_SEGMENT_SIZE`.
   
   To avoid this, I suggest that we add a `readPartialSegmentState` method to 
`SegmentWriter`, which takes the segment ID of an unflushed segment and returns 
it if possible.
   
   ## Backwards-compatibility
   This change is backwards-compatible with existing users of `SegmentWriter` 
(because they’re not using the new method). The new method comes with a default 
implementation which throws an `UnsupportedOperationException`.
   
   ## Concurrency
   Previously, the class was marked as *not thread-safe*, which made sense 
since it was only expected that a single writer thread uses it at the same time 
(concurrent calls wouldn’t have made sense since the order in which `prepare` 
and `writeXYZ` methods are called matters).
   
   One major change with `SegmentBufferWriter` is that its 
`readPartialSegmentState` method can now be called concurrently with the other 
methods in the same class. To support this, we now use `synchronized` on the 
methods that are accessible publicly. This shouldn’t cause a drop in 
performance, because most calls to the class are on the writer thread (so not 
concurrent between themselves), and it is expected from 
`readPartialSegmentState` to be called rarely (compared to the other methods).
   
   I could confirm that there is no noticeable drop in performance by running 
the write benchmarks without and with the change, and observed no difference:
   
   **Without `synchronized`**
   ``` 
   # ConcurrentWriteReadTest          C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1       1       5      11      61    258  
  2535      24
   # ConcurrentWriteTest              C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1      29      31      36      58    622  
  1373      44
   # BasicWriteTest                   C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1      14      15      16      19    320  
  3448      17
   ```
   
   **With `synchronized`**
   ```
   # ConcurrentWriteReadTest          C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1       1       4      12      58    553  
  2531      24
   # ConcurrentWriteTest              C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1      29      31      35      65    461  
  1319      46
   # BasicWriteTest                   C     min     10%     50%     90%     max 
    N       mean
   Oak-Segment-Tar                    1      14      15      16      19     96  
  3444      17
   ```
   
   ## Testing
   The PR adds a new test, `readPartialSegmentState`, which covers the 
implementation of the method in `SegmentBufferWriter`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OAK-11154: Read partial segments from SegmentWriter [jackrabbit-oak]

Reply via email to