brodin opened a new issue, #36067:
URL: https://github.com/apache/beam/issues/36067

   ### What happened?
   
   ### **Description**
   
   The default batch size for `PubsubIO.Write` in the Java SDK is smaller than 
the maximum allowed message size in Google Cloud Pub/Sub. This prevents 
messages that are near the 10MB limit from being sent in a batch, causing 
unexpected behavior and failures.
   
   The Google Cloud Pub/Sub documentation specifies a **maximum message size of 
10MB**. However, the default batch size in the Apache Beam Java SDK is set to a 
value that is less than this limit. This results in an exception when trying to 
send a message that is larger than the default batch size, even if it is 
smaller than the 10MB Pub/Sub limit.
   
   This can be confusing for developers who expect to be able to send messages 
up to the documented Pub/Sub limit. It also requires a manual workaround to set 
a larger batch size, which may not be obvious to all users.
   
   -----
   
   ### **Steps to Reproduce**
   
   1.  Create a pipeline that uses `PubsubIO.Write` to send a message to a 
Pub/Sub topic.
   2.  Create a message that is larger than the default batch size, but smaller 
than 10MB (e.g., 8MB).
   3.  Run the pipeline without explicitly setting the `maxBatchBytesSize`.
   
   -----
   
   ### **Expected Behavior**
   
   The pipeline should be able to send a message that is smaller than the 10MB 
Pub/Sub limit, even if it is larger than the default batch size. The default 
batch size should be at least as large as the maximum allowed message size.
   
   -----
   
   ### **Actual Behavior**
   
   The pipeline fails with a `javax.naming.SizeLimitExceededException`, 
indicating that the message size exceeds the batch size limit. The error 
message is similar to the following:
   
   ```java
   javax.naming.SizeLimitExceededException: Pubsub message of length 8000000 
exceeds maximum of 7500000 bytes, when considering the payload and attributes. 
See https://cloud.google.com/pubsub/quotas#resource_limits
   ```
   
   -----
   
   ### **Proposed Solution**
   
   There are a few possible solutions to this issue:
   
     * **Increase the default `maxBatchBytesSize` to 10MB.** This would align 
the default behavior with the documented Pub/Sub limit and allow larger 
messages to be sent without any additional configuration.
     * **Improve the documentation to make it clear that the default batch size 
is smaller than the maximum message size.** This would help developers 
understand the limitation and know that they need to manually configure the 
batch size for larger messages.
   
   Given the principle of least surprise, increasing the default batch size 
seems like the most appropriate solution. It would make the library more 
intuitive to use and prevent unexpected failures. If there is a reason for the 
default batch size to be smaller than 10MB, this should be clearly documented.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [x] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to