ehoner opened a new pull request, #1662:
URL: https://github.com/apache/samza/pull/1662

   The existing implementation can cause GC performance issues and even OOM 
errors when the underlying buffer, `ByteArrayOutputStream`, is initialized to 
the maximum blob size (default is 10MB). This adds a new configuration 
parameter to control init size **and** sets the default to 32 (bytes). 32 is 
the default size for `ByteArrayOutputStream`, which will grow as needed. The 
[AzureBlobAvroWriter#L176](https://github.com/apache/samza/blob/03b187a6de0e123568f3ce3af94c946e6380fc8d/samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobAvroWriter.java#L176)
 instance prevents "maximum blob size" from being exceeded, so the size does 
not need to be "guarded" by the `AzureBlobOutputStream`, although these 
responsibilities are not clearly separated in 
[SEP-26](https://cwiki.apache.org/confluence/display/SAMZA/SEP-26:+Azure+Blob+Storage+Producer).
 
   
   #### GC Discussion
   The focus here is on the G1 GC, the default GC in Java 11+, and humongous 
objects (G1 specific).[^1] The G1 GC introduced a new memory management 
strategy that divides the Heap into regions, `-XX:G1HeapRegionSize=n`. The 
default behavior creates ~2048 regions that are a factor of 2 between 1MB and 
32MB. Any object larger than half of a region size, is considered a humongous 
object. Humongous objects are allocated an entire region (or consecutive 
regions) and any remaining space is non-addressable for the life of the 
humongous object.[^2] A JVM heap size of 31GB, `-Xmx31G`, will default to 16MB 
regions, which means each buffer requires an entire region **and** prevent the 
use of 6MB, regardless of the how much data is in the buffer. This buffer size 
can also complicate memory allocation on `new`, when the JVM immediately 
promotes an object to Perm Gen because there is insufficient space in Eden and 
the G1 has a strict minimum space for Young Gen, the JVM can exit with an OOM 
if the
 re are no empty regions.[^3] 
   
   This significance of this issue is directly related to the number of buffers 
allocated. Systems allocating a large number of buffers are susceptible to this 
issue. Using the default size allows the JVM to allocate memory as needed and 
avoid designs that interfere with GC architecture. For any users that encounter 
issues caused by buffer growth, the configuration parameter allows them to tune 
their system accordingly.
   
   
   [^1]: "[Garbage-First Garbage Collector: Humongous 
Objects](https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-g1-garbage-collector1.html#GUID-D74F3CC7-CC9F-45B5-B03D-510AEEAC2DAC)"
   [^2]: "[What’s the deal with humongous objects in 
Java?](https://devblogs.microsoft.com/java/whats-the-deal-with-humongous-objects-in-java/)"
   [^3]: "[Part 1: Introduction to the G1 Garbage 
Collector](https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to