Re: [PR] [FLINK-37298] Added Pluggable Components for BatchStrategy & BufferWrapper in AsyncSinkWriter. [flink]

via GitHub Sat, 05 Apr 2025 10:10:09 -0700


hlteoh37 commented on code in PR #26274:
URL: https://github.com/apache/flink/pull/26274#discussion_r2025060911



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -101,19 +101,25 @@ public abstract class AsyncSinkWriter<InputT, 
RequestEntryT extends Serializable
 
     /**
      * Buffer to hold request entries that should be persisted into the 
destination, along with its
-     * size in bytes.
+     * total size in bytes.
      *
-     * <p>A request entry contain all relevant details to make a call to the 
destination. Eg, for
-     * Kinesis Data Streams a request entry contains the payload and partition 
key.
+     * <p>This buffer is managed using {@link BufferWrapper}, allowing sink 
implementations to
+     * define their own optimized buffering strategies. By default, {@link 
DequeBufferWrapper} is
+     * used.
      *
-     * <p>It seems more natural to buffer InputT, ie, the events that should 
be persisted, rather
-     * than RequestEntryT. However, in practice, the response of a failed 
request call can make it
-     * very hard, if not impossible, to reconstruct the original event. It is 
much easier, to just
-     * construct a new (retry) request entry from the response and add that 
back to the queue for
-     * later retry.
+     * <p>The buffer stores {@link RequestEntryWrapper} objects rather than 
raw {@link

Review Comment:
   Let's keep the original explanation around why we chose `RequestEntryT` 
instead of `InputT` as the generic buffer type. 
   
   nit: The explanation around using `RequestEntryWrapper` in the interface of 
the `BufferWrapper` seems abit out of place here since we pass in generic type 
of `RequestEntryWrapper` here! Should we move it to the `BufferWrapper` class 
instead?



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -245,10 +260,17 @@ public AsyncSinkWriter(
         this.rateLimitingStrategy = configuration.getRateLimitingStrategy();
         this.requestTimeoutMS = configuration.getRequestTimeoutMS();
         this.failOnTimeout = configuration.isFailOnTimeout();
-
+        this.bufferedRequestEntries =
+                bufferedRequestEntries == null
+                        ? new 
DequeBufferWrapper.Builder<RequestEntryT>().build()
+                        : bufferedRequestEntries;

Review Comment:
   nit: Should we just remove the `Wrapper` bit? We could simply name it 
`Buffer` instead, or `RequestBuffer`?



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -303,7 +325,7 @@ public void write(InputT element, Context context) throws 
IOException, Interrupt
     private void nonBlockingFlush() throws InterruptedException {
         while (!rateLimitingStrategy.shouldBlock(createRequestInfo())
                 && (bufferedRequestEntries.size() >= getNextBatchSizeLimit()
-                        || bufferedRequestEntriesTotalSizeInBytes >= 
maxBatchSizeInBytes)) {
+                        || bufferedRequestEntries.totalSizeInBytes() >= 
maxBatchSizeInBytes)) {

Review Comment:
   Would be better to rename `maxBatchSizeInBytes` here as something more 
meaningful like `flushThresholdBytes`.
   
   Previously, `maxBatchSizeInBytes` was used in two places
   1/ Constructing batch (prevent the constructed batch from exceeding X 
bytes). This logic is in `createNextAvailableBatch()`.
   2/ Determine whether to flush. If the buffer has records of size greater 
than `maxBatchSizeInBytes`, flush the records. This logic is in 
`nonBlockingFlush()`.
   
   Since we are delegating responsibility of (1) to the `batchCreator`, it 
would be clearer if we renamed this to `flushThresholdBytes`? Happy to rename 
internally first, and followup with a separate PR to clean up the configuration 
interfaces in backwards compatible manner.



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -213,11 +213,26 @@ protected void submitRequestEntries(
      */
     protected abstract long getSizeInBytes(RequestEntryT requestEntry);
 
+    /**
+     * This constructor is deprecated. Users should use {@link 
#AsyncSinkWriter(ElementConverter,
+     * WriterInitContext, AsyncSinkWriterConfiguration, Collection, 
BatchCreator, BufferWrapper)}.
+     */
+    @Deprecated
     public AsyncSinkWriter(
             ElementConverter<InputT, RequestEntryT> elementConverter,
             WriterInitContext context,
             AsyncSinkWriterConfiguration configuration,
             Collection<BufferedRequestState<RequestEntryT>> states) {
+        this(elementConverter, context, configuration, states, null, null);
+    }
+
+    public AsyncSinkWriter(
+            ElementConverter<InputT, RequestEntryT> elementConverter,
+            WriterInitContext context,
+            AsyncSinkWriterConfiguration configuration,
+            Collection<BufferedRequestState<RequestEntryT>> states,
+            @Nullable BatchCreator<RequestEntryT> batchCreator,
+            @Nullable BufferWrapper<RequestEntryT> bufferedRequestEntries) {

Review Comment:
   Hmm, it would be cleaner (and easier to maintain backwards compatibility if 
we set up the configuration for `batchCreator` / `bufferedRequestEntries` in 
the `AsyncSinkWriterConfiguration` object.
   
   This also removes the need for sink implementors to make any changes in 
their code signature at all! :) 
   
   ### Reasoning
   With the current proposal, sink implementers using the default 
implementation would need to pass null to the new constructor. If implementers 
want to override just one, they would need to also ensure the right argument is 
`null`.
   ```
   super(elementConverter, context, configuration, states, null, null);
   // Only override batchCreator
   super(elementConverter, context, configuration, states, batchCreator, null);
   // Only override bufferWrapper
   super(elementConverter, context, configuration, states, null, bufferWrapper);
   ```
   
   Compare this instead to setting this in the `AsyncSinkWriterConfiguration` 
object.
   
   ```
   // specify both options
   AsyncSinkWriterConfiguration.builder()
     .setBatchCreator(batchCreator)
     .setBufferWrapper(bufferWrapper)
     .build()
   
   // specify single option
   AsyncSinkWriterConfiguration.builder()
     .setBatchCreator(batchCreator)
     .build()
   
   // use defaults
   AsyncSinkWriterConfiguration.builder()
     .build()
   ```
   



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -327,7 +349,12 @@ private void flush() throws InterruptedException {
             requestInfo = createRequestInfo();
         }
 
-        List<RequestEntryT> batch = createNextAvailableBatch(requestInfo);
+        Batch<RequestEntryT> batchCreationResult =

Review Comment:
   nit: rename as `batch` instead of `batchCreationResult`



##########
flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink/writer/AsyncSinkWriter.java:
##########
@@ -327,7 +349,12 @@ private void flush() throws InterruptedException {
             requestInfo = createRequestInfo();
         }
 
-        List<RequestEntryT> batch = createNextAvailableBatch(requestInfo);
+        Batch<RequestEntryT> batchCreationResult =
+                batchCreator.createNextBatch(requestInfo, 
bufferedRequestEntries);

Review Comment:
   We should make clear that this method does 2 things
   1/ Mutate `bufferedRequestEntries` data structure.
   2/ Return the created batch. 
   
   This means that we need to make clear expectations and guarantees around
   1/ **Concurrent calls / thread safety.** We need to make sure that a new 
instance of `bufferWrapper` is created per parallelism of async sink. AND that 
for each parallelism, `batchCreator.createNextBatch()` is not called 
concurrently (otherwise implementors need to ensure thread safe calling. We 
enforce item (2) at the moment because we call this as part of the `flush()`, 
which runs on main Flink thread. We should make this clear on the interface.
   2/ Error handling. Users need to ensure that the interaction between both 
implementations handle the errors well (don't pull from buffer, but not put in 
batch).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-37298] Added Pluggable Components for BatchStrategy & BufferWrapper in AsyncSinkWriter. [flink]

Reply via email to