[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Alex Herbert (Jira) Mon, 21 Aug 2023 06:35:10 -0700


    [ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756923#comment-17756923
 ]


Alex Herbert commented on TEXT-228:
-----------------------------------

Throwing an OutOfMemoryError is reasonable practice when writing objects that 
allocate memory. In this case the error will arise when new chars are to added 
the char buffer managed by the TextStringBuilder. Whether the error arises 
inside Arrays.copyOf(buffer, newLength) or we throw one because we know the 
newLength is not possible to allocate the behaviour to the downstream user is 
identical. They tried to increase the size of the TextStringBuilder and ran out 
of memory.

If we change to throwing a different runtime exception I do not see that 
information being useful for the user. They did something that was not possible 
within the memory limits of the JVM.

PR for review here: [PR 452|https://github.com/apache/commons-text/pull/452]

I adapted code from Commons Codec where we do a similar buffer management in 
the BaseNCodec for a byte[] buffer. I expect the tests that push the memory to 
the limit to be skipped in the CI. They do run locally if you have enough 
memory. The smallest memory footprint test still requires at least 2GiB. This 
may be executed.

 

> StringTokenizer performance degradation when parsing large lines
> ----------------------------------------------------------------
>
>                 Key: TEXT-228
>                 URL: https://issues.apache.org/jira/browse/TEXT-228
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.9, 1.10.0
>         Environment: Linux
>            Reporter: Zack Hable
>            Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Reply via email to