[
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zack Hable updated TEXT-228:
----------------------------
Summary: StringTokenizer performance degradation when parsing large lines
(was: StringTokenizer performance degradation when parsing log lines)
> StringTokenizer performance degradation when parsing large lines
> ----------------------------------------------------------------
>
> Key: TEXT-228
> URL: https://issues.apache.org/jira/browse/TEXT-228
> Project: Commons Text
> Issue Type: Bug
> Affects Versions: 1.10.0
> Environment: Linux
> Reporter: Zack Hable
> Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed
> our system "hangs" (or likely will take an excessively long time to process)
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>
> Mitigation: Revert to Apache Commons Text 1.9
>
> Scala version:
>
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and
> Lightbend, Inc.
> {code}
>
> Java version:
>
> {code:java}
> > java -version
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>
>
> Reproduction Steps:
> # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
> # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
> # Download Apache Commons Jars
> {code:java}
> wget
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
> # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s
> user 0.83s system 121% cpu 2.818 total
>
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10
> seconds. I haven't come across a definite amount of time 1.10 takes, but it
> seems to run for >1 minute
--
This message was sent by Atlassian Jira
(v8.20.10#820010)