Zack Hable created TEXT-228:
-------------------------------

             Summary: StringTokenizer performance degradation when parsing log 
lines
                 Key: TEXT-228
                 URL: https://issues.apache.org/jira/browse/TEXT-228
             Project: Commons Text
          Issue Type: Bug
    Affects Versions: 1.10.0
         Environment: Linux
            Reporter: Zack Hable


After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
our system "hangs" (or likely will take an excessively long time to process) 
large lines (100MB+ in size) when splitting strings with StringTokenizer.

 

Mitigation: Revert to Apache Commons Text 1.9

 

Scala version:

 
{code:java}
> scala -version
Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
Lightbend, Inc.
{code}
 

Java version:

 
{code:java}
> java -version 
openjdk version "1.8.0_382"
OpenJDK Runtime Environment (build 1.8.0_382-b05)
OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
{code}
 

 

Reproduction Steps:
 # Generate a sample large file

{code:java}
echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
dd if=/dev/zero bs=100MB count=1 >> largefile 
sed -ie "s/\x0/0/g" largefile
echo -n "\0" >> largefile
{code}

 # Setup reproduce.scala

{code:java}
import org.apache.commons.text.StringTokenizer
val lines = scala.io.Source.fromFile("./largefile").getLines.toList
val st: StringTokenizer = new StringTokenizer(lines(0))
val res = st.getTokenArray()
{code}

 # Download Apache Commons Jars

{code:java}
wget 
https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar

wget 
https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
{code}

 # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)

{code:java}
> time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s user 
0.83s system 121% cpu 2.818 total
 
> time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
user 0.00s system 0% cpu 10.002 total
{code}
As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to