Zack Hable created TEXT-228:
-------------------------------
Summary: StringTokenizer performance degradation when parsing log
lines
Key: TEXT-228
URL: https://issues.apache.org/jira/browse/TEXT-228
Project: Commons Text
Issue Type: Bug
Affects Versions: 1.10.0
Environment: Linux
Reporter: Zack Hable
After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed
our system "hangs" (or likely will take an excessively long time to process)
large lines (100MB+ in size) when splitting strings with StringTokenizer.
Mitigation: Revert to Apache Commons Text 1.9
Scala version:
{code:java}
> scala -version
Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and
Lightbend, Inc.
{code}
Java version:
{code:java}
> java -version
openjdk version "1.8.0_382"
OpenJDK Runtime Environment (build 1.8.0_382-b05)
OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
{code}
Reproduction Steps:
# Generate a sample large file
{code:java}
echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
dd if=/dev/zero bs=100MB count=1 >> largefile
sed -ie "s/\x0/0/g" largefile
echo -n "\0" >> largefile
{code}
# Setup reproduce.scala
{code:java}
import org.apache.commons.text.StringTokenizer
val lines = scala.io.Source.fromFile("./largefile").getLines.toList
val st: StringTokenizer = new StringTokenizer(lines(0))
val res = st.getTokenArray()
{code}
# Download Apache Commons Jars
{code:java}
wget
https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
wget
https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
{code}
# Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
{code:java}
> time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s user
0.83s system 121% cpu 2.818 total
> time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s
user 0.00s system 0% cpu 10.002 total
{code}
As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10
seconds. I haven't come across a definite amount of time 1.10 takes, but it
seems to run for >1 minute
--
This message was sent by Atlassian Jira
(v8.20.10#820010)