Re: [PR] [LANG-1805] Optimize StringBuilder usage in join() for primitives [commons-lang]

via GitHub Thu, 18 Dec 2025 00:03:16 -0800


jher235 commented on PR #1532:
URL: https://github.com/apache/commons-lang/pull/1532#issuecomment-3668917873


   Hi @garydgregory ,
   
   Thank you for your patience and for encouraging the benchmark.
   
   ### Transparency Note
   As mentioned in the checklist: I used AI assistance to help structure this 
PR and understand benchmarking best practices. However, the problem 
identification, benchmark execution, data analysis, and code implementation are 
my own work.
   
   ---
   
   ## Benchmark Results
   
   I've completed the JMH benchmarks, and the results reveal this PR should be 
viewed in two distinct parts:
   
   ### Part 1: Critical Bug Fix (`boolean` and `char`)
   
   The current implementation has a **structural flaw** where it calculates 
`StringBuilder` capacity based on the full `array.length` rather than the 
actual range being joined (`endIndex - startIndex`).
   
   **Impact on subset operations:**
   
   | Metric | Current (bug) | Fixed | Improvement |
   |--------|---------------|-------|-------------|
   | Time   | 6,512 ns      | 110 ns | **59x faster** |
   | Memory | 60,136 bytes  | 200 bytes | **300x less allocation** |
   
   *Test: Joining 10 elements from a 10,000-element boolean array*
   
   This isn't a theoretical edge case - it happens whenever the 
`startIndex`/`endIndex` parameters are used for their intended purpose. **This 
is a correctness issue that needs fixing regardless of other optimizations.**
   
   ---
   
   ### Part 2: General Optimization (`int` and other numeric types)
   
   For primitives that currently use the default StringBuilder constructor, I 
tested three array sizes with realistic integer values:
   
   | Array Size | Time (ns/op) | Memory (B/op) | Observation |
   |------------|--------------|---------------|-------------|
   | **Small (10)** | | | |
   | Current | 233 | 280 | Baseline |
   | Optimized | **170** | **160** | **27% faster, 43% less memory** |
   | **Medium (100)** | | | |
   | Current | 1,821 | 1,800 | Baseline |
   | Optimized | 1,714 | 1,808 | **~Same performance** |
   | **Large (1000)** | | | |
   | Current | 19,988 | 24,640 | Baseline |
   | Optimized | **19,600** | **18,104** | **26% less memory** |
   
   **Key findings:**
   
   - **Small arrays**: Clear wins (27% speed, 43% memory). The default 16-char 
buffer is immediately too small.
   - **Medium arrays**: Neutral impact. Both approaches trigger ~1-2 resizes, 
resulting in comparable performance.
   - **Large arrays**: Modest speed gain (2%) but significant memory reduction 
(26%). Multiple resizes in the old approach cause memory churn.
   
   **Bottom line**: The optimization is "regression-free" - it provides 
meaningful gains in common cases (small and large arrays) and stays neutral in 
between.
   
   ---
   
   ## Assessment
   
   ### What we're fixing:
   1. **Boolean/char bug**: Dramatic over-allocation in subset scenarios (59x 
improvement).
   2. **Numeric types**: Situational improvements (Small/Large) with no 
downsides.
   
   ### Trade-off:
   - **Gained**: Bug fix + performance improvements where it matters.
   - **Cost**: One extra variable (`noOfItems`) per method + capacity 
calculations.
   
   ---
   
   ## How Should We Proceed?
   
   Respecting your "less code is easier to maintain" philosophy, here are two 
paths forward:
   
   **Option A (Recommended)**: Accept the full PR
   - Fixes the boolean/char bug (critical).
   - Applies consistent optimization pattern across all primitives.
   - No performance regressions anywhere.
   - Slightly more code, but uniform approach.
   
   **Option B (Minimal)**: Only fix boolean/char
   - Just replace `array.length` with `noOfItems` in boolean/char methods.
   - Leave numeric types unchanged.
   - Minimal code change.
   - Accepts minor inefficiencies in numeric types.
   
   I personally lean toward **Option A** for consistency and completeness, but 
I'm genuinely comfortable with **Option B** if you prefer to minimize changes.
   
   What's your preference?
   
   ---
   
   *Environment: JMH 1.37, 3 warmup + 5 measurement iterations*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [LANG-1805] Optimize StringBuilder usage in join() for primitives [commons-lang]

Reply via email to