[
https://issues.apache.org/jira/browse/LANG-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293874#comment-15293874
]
ASF GitHub Bot commented on LANG-1206:
--------------------------------------
GitHub user PascalSchumacher opened a pull request:
https://github.com/apache/commons-lang/pull/147
LANG-1206: improve CharSetUtils.squeeze() performace
patch by Mohammed Alfallaj
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PascalSchumacher/commons-lang
CharSetUtils#squeeze
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/commons-lang/pull/147.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #147
----
commit 8c87ace5492a7093e93723571d95aeab47539166
Author: pascalschumacher <[email protected]>
Date: 2016-05-20T18:04:30Z
LANG-1206: improve CharSetUtils.squeeze() performace
patch by Mohammed Alfallaj
----
> CharSetUtils.squeeze() Performance Issue
> ----------------------------------------
>
> Key: LANG-1206
> URL: https://issues.apache.org/jira/browse/LANG-1206
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
> Affects Versions: 3.4
> Reporter: Mohammed Alfallaj
> Labels: patch, performance
> Fix For: 3.4
>
> Attachments: CharSetUtils_squeeze_performance_issue.patch,
> Description_Squeeze.docx
>
>
> Before reading the description below, please note that there was an issue
> created in 10/Jul/2011 titled "CharSetUtils.squeeze() speedup" LANG-715. This
> solution does not prevent the repetitive calls to the expensive contains()
> method explained below. In addition, I updated this to (major) because of the
> high CPU cost of the current solution.
> A readable form of the description is attached in Description_Squeeze.docx.
> Below is a text format of the description (in case the .docx does not work):
> ***********************************************************************
> Performance problems of current CharSetUtils.squeeze() algorithm:
> ***********************************************************************
> ************************************
> FIRST PROBLEM:
> ************************************
> -Method org.apache.commons.lang3.CharSet.contains() is called in each
> character iteration for a consecutive set of duplicated characters in str
> (starting from the second character in the duplicated set). This large number
> of calls to contains() are determined to increase the performance overhead
> for squeeze() calls given the high computing cost of executing contains()
> method. Note that only one call to contains() is needed for the second
> character in this set of duplicated characters. There is no need to repeat
> the calls to contains() method for the same duplicated character that we have
> already checked in the second occurrence. For example, if str=”abbbbbc” and
> set=”b”, then the algorithm should call contains() only once for the second b
> character occurrence and prevent calling contains() again given that it has
> already been called before for this block. The current algorithm adds high
> performance overhead by calling contains() repetitively for each duplicated b
> character starting from the second occurrence. For this example where
> str=”abbbbbc” and set=”b”, the suggested solution will call contains() method
> only once instead of 4 times which will largely reduce the performance
> overhead of squeeze() method.
> -Consider the following example, System.currentTimeMillis() testing shows a
> performance improvement by around 83% if we run the suggested enhanced
> solution with the following sample input data
> str =
> “abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
> set = “b”
> ************************************
> PROPOSED SOLUTION FOR FIRST PROBLEM:
> ************************************
> -Using two character flags inChars and notInChars:
> inChars: flag to record the recent character that has been determined
> to be in chars CharSet
> notInChars: flag to record the recent character that has been
> determined to NOT be in chars CharSet
> -Adding the following conditions to prevent unnecessary/expensive calls to
> contains() method. Only one call to contains() method is needed starting from
> the second character in the consecutive set of duplicated characters:
> if ((inChars != null) && (ch == inChars))
> this if-statement checks if this ch char has recently been
> determined to be in chars CharSet, this prevents calling contains() more than
> once for a consecutive set of duplicated characters in str
> if one of these conditions fails, it means that the consecutive
> set of duplicated characters for this ch character has not yet been checked
> for this block of duplicated characters (i.e. has not been checked = we have
> not yet done the one-time call to contains method for this particular set of
> duplicated characters)
> if ((notInChars == null) || (ch != notInChars))
> if both conditions are false, it means that ch has recently
> been determined to NOT be in chars CharSet
> if (chars.contains(ch))
> here we call the expensive chars.contains() only once for this
> consecutive set of duplicated characters.
> -Note: this suggested solution does not break the functionality of
> org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This
> suggested solution is executed against the latest UnitTest
> CharSetUtilsTest.java written by Apache.
> ************************************
> SECOND PROBLEM:
> ************************************
> the current algorithm starts the for-loop with i=0 which is not a necessary
> step. This adds the overhead of executing the if condition (i != 0). In this
> logic, the algorithm will always append the first character from str to the
> buffer. So there is no need to execute (i != 0) for every character in the
> consecutive set of duplicated characters.
> ************************************
> PROPOSED SOLUTION FOR SECOND PROBLEM:
> ************************************
> lastChar is assigned the first character in str and appended to buffer before
> executing the for loop. Then the for loop begins with i=1 removing the
> unnecessary (i != 0) executions.
> END OF DESCRIPTION
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)