[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Alex Herbert (Jira) Mon, 21 Aug 2023 04:10:08 -0700


    [ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756840#comment-17756840
 ]


Alex Herbert commented on TEXT-228:
-----------------------------------

The performance regression can be traced to commit 808ecca following a 
discussion in [PR 132|https://github.com/apache/commons-text/pull/132].

This changed the TextStringBuilder.ensureCapacity method from:
{code:java}
public TextStringBuilder ensureCapacity(final int capacity) {
    if (capacity > buffer.length) {
        reallocate(capacity * 2);
    }
    return this;
}
{code}
 
To:
{code:java}
public TextStringBuilder ensureCapacity(final int capacity) {
    // checks for overflow
    if (capacity > 0 && capacity - buffer.length > 0) {
        reallocate(capacity);      //  <--- FYI: No capacity doubling here!
    }
    return this;
}
{code}
This change *tries* to ensure that the check for capacity is overflow safe. 
However there are issues:
h2. 1. Performance

Previously the method would allocate double the required capacity. This is a 
bit excessive. However it does reduce the number of memory reallocations 
significantly.

The new method will only allocate the memory required. There is {*}no extra 
buffer allocated{*}.

The result is that when appending single characters (as done by the 
StringTokenizer) every new character requires a copy of the entire character 
buffer.

Note: The overflow check is not the reason for the slowdown. On my machine I 
can use the original method and the updated method with approximately the same 
runtime if the updated method is changed to use reallocate(capacity * 2).

Also note that this is a specific use case for appending single characters to a 
buffer. In this case the use of ensureCapacity(size + 1) seems overkill. I 
changed the append(char) method from:
{code:java}
public TextStringBuilder append(final char ch) {
    final int len = length();
    ensureCapacity(len + 1);
    buffer[size++] = ch;
    return this;
}
{code}
To:
{code:java}
public TextStringBuilder append(final char ch) {
    if (size == buffer.length) {
        reallocate(size * 2);
    }
    buffer[size++] = ch;
    return this;
}
{code}
This made no notable speed difference to the provided StringTokenizer code. 
However that program has a lot of additional logic performed per character 
processed from the input by the tokenizer. A JMH test appending single 
characters to the TextStringBuilder should be used to examine more closely this 
potential optimisation.
h2. 2. No feedback on error

The method (still) does not feed back to the user if the requested change to 
the capacity was successful.

Current unit tests actually enforce this behaviour. E.g. calling 
ensureCapacity(-1) is allowed.

This behaviour may appear strange but is the same as the JDK's 
ArrayList.ensureCapacity. We should add some javadoc to indicate that only 
positive arguments are recognised by the ensureCapacity method. Negative 
arguments are currently ignored.

This does create one issue in that the method is also used extensively 
internally when adding new chars. In this case it should raise as error if 
adding failed. If you try and append to a TextStringBuilder something that is 
too long for an array buffer, it will not fail with an OutOfMemoryError. It 
performs a no-op when trying to ensureCapacity and then will fail with an 
IndexOutOfBoundsException when actually copying the chars. E.g. this test does 
not pass:
{code:java}
@Test
public void testOutOfMemoryError() {
    char[] chars = new char[3 * (Integer.MAX_VALUE / 8)];
    final TextStringBuilder sb = new TextStringBuilder();
    sb.append(chars);
    sb.append(chars);
    assertThrows(OutOfMemoryError.class, () -> sb.append(chars));
}
{code}
h2. 3. Overflow safe

The overflow safe code is redundant. In this check:
{code:java}
if (capacity > 0 && capacity - buffer.length > 0) { {code}
The capacity is used as a signed integer, then as an unsigned integer in the 
overflow safe subtraction to determine if the buffer is long enough. So it is 
mixing the interpretation of capacity and effectively only performing:
{code:java}
if (capacity > buffer.length) { {code}
So the change in commit 808ecca only solved the issue raised by PR 132 due to 
the change from:
{code:java}
reallocate(capacity * 2) {code}
to:
{code:java}
reallocate(capacity) {code}
But this introduced a performance regression.
h2. Solution

What is required:
 # The existing ensureCapacity behaviour is unchanged. It should no-op when the 
capacity is negative.
 # The checks to ensure new chars can be added should be overflow safe. If 
overflow occurs in the capacity, then an OutOfMemoryError should be raised
 # Increases in the capacity should ensure that the new capacity can store at a 
minimum the requested capacity. This should allow extra buffer to reduce 
reallocations.

The solution is to use an internal method to ensure the capacity when a source 
of known length is to be added. This can throw an OutOfMemoryError if the 
capacity is too large. Otherwise the buffer can be expanded to allow the 
addition. This should re-establish adding extra buffer to the desired capacity 
to prevent excessive array copies during addition of items when the capacity is 
reached.

> StringTokenizer performance degradation when parsing large lines
> ----------------------------------------------------------------
>
>                 Key: TEXT-228
>                 URL: https://issues.apache.org/jira/browse/TEXT-228
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.9, 1.10.0
>         Environment: Linux
>            Reporter: Zack Hable
>            Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Reply via email to