[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Alex Herbert (Jira) Wed, 09 Aug 2023 09:44:06 -0700


    [ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752477#comment-17752477
 ]


Alex Herbert commented on TEXT-228:
-----------------------------------

Thanks for the report. I have verified your results still exist in the current 
master branch (1.11-SNAPSHOT). Looking at the changes since the 1.9 release in 
the StringTokenizer class there does not appear to be any obvious source for 
the regression. I reset the file to the version released in 1.9 circa 
2020-07-21 and the issue is still present. Thus the issue is with the internal 
classes used by the tokenizer.

> StringTokenizer performance degradation when parsing large lines
> ----------------------------------------------------------------
>
>                 Key: TEXT-228
>                 URL: https://issues.apache.org/jira/browse/TEXT-228
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>         Environment: Linux
>            Reporter: Zack Hable
>            Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

Reply via email to