ehoner opened a new pull request, #1662: URL: https://github.com/apache/samza/pull/1662
The existing implementation can cause GC performance issues and even OOM errors when the underlying buffer, `ByteArrayOutputStream`, is initialized to the maximum blob size (default is 10MB). This adds a new configuration parameter to control init size **and** sets the default to 32 (bytes). 32 is the default size for `ByteArrayOutputStream`, which will grow as needed. The [AzureBlobAvroWriter#L176](https://github.com/apache/samza/blob/03b187a6de0e123568f3ce3af94c946e6380fc8d/samza-azure/src/main/java/org/apache/samza/system/azureblob/avro/AzureBlobAvroWriter.java#L176) instance prevents "maximum blob size" from being exceeded, so the size does not need to be "guarded" by the `AzureBlobOutputStream`, although these responsibilities are not clearly separated in [SEP-26](https://cwiki.apache.org/confluence/display/SAMZA/SEP-26:+Azure+Blob+Storage+Producer). #### GC Discussion The focus here is on the G1 GC, the default GC in Java 11+, and humongous objects (G1 specific).[^1] The G1 GC introduced a new memory management strategy that divides the Heap into regions, `-XX:G1HeapRegionSize=n`. The default behavior creates ~2048 regions that are a factor of 2 between 1MB and 32MB. Any object larger than half of a region size, is considered a humongous object. Humongous objects are allocated an entire region (or consecutive regions) and any remaining space is non-addressable for the life of the humongous object.[^2] A JVM heap size of 31GB, `-Xmx31G`, will default to 16MB regions, which means each buffer requires an entire region **and** prevent the use of 6MB, regardless of the how much data is in the buffer. This buffer size can also complicate memory allocation on `new`, when the JVM immediately promotes an object to Perm Gen because there is insufficient space in Eden and the G1 has a strict minimum space for Young Gen, the JVM can exit with an OOM if the re are no empty regions.[^3] This significance of this issue is directly related to the number of buffers allocated. Systems allocating a large number of buffers are susceptible to this issue. Using the default size allows the JVM to allocate memory as needed and avoid designs that interfere with GC architecture. For any users that encounter issues caused by buffer growth, the configuration parameter allows them to tune their system accordingly. [^1]: "[Garbage-First Garbage Collector: Humongous Objects](https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-g1-garbage-collector1.html#GUID-D74F3CC7-CC9F-45B5-B03D-510AEEAC2DAC)" [^2]: "[What’s the deal with humongous objects in Java?](https://devblogs.microsoft.com/java/whats-the-deal-with-humongous-objects-in-java/)" [^3]: "[Part 1: Introduction to the G1 Garbage Collector](https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector)" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
