Mohammed Alfallaj created LANG-1206:
---------------------------------------

             Summary: CharSetUtils.squeeze() Performance Issue
                 Key: LANG-1206
                 URL: https://issues.apache.org/jira/browse/LANG-1206
             Project: Commons Lang
          Issue Type: Improvement
          Components: lang.*
    Affects Versions: 3.4
            Reporter: Mohammed Alfallaj
            Priority: Minor
             Fix For: 3.4


A readable form of the description is attached in Description_Squeeze.docx.

Below is a text format of the description (in case the .docx does not work):


***********************************************************************
Performance problems of current CharSetUtils.squeeze() algorithm:
***********************************************************************


************************************
FIRST PROBLEM: 
************************************
-Method org.apache.commons.lang3.CharSet.contains() is called in each character 
iteration for a consecutive set of duplicated characters in str (starting from 
the second character in the duplicated set). This large number of calls to 
contains() are determined to increase the performance overhead for squeeze() 
calls given the high computing cost of executing contains() method. Note that 
only one call to contains() is needed for the second character in this set of 
duplicated characters. There is no need to repeat the calls to contains() 
method for the same duplicated character that we have already checked in the 
second occurrence. For example, if str=”abbbbbc” and set=”b”, then the 
algorithm should call contains() only once for the second b character 
occurrence and prevent calling contains() again given that it has already been 
called before for this block. The current algorithm adds high performance 
overhead by calling contains() repetitively for each duplicated b character 
starting from the second occurrence. For this example where str=”abbbbbc” and 
set=”b”, the suggested solution will call contains() method only once instead 
of 4 times which will largely reduce the performance overhead of squeeze() 
method. 

-Consider the following example, System.currentTimeMillis() testing shows a 
performance improvement by around 83% if we run the suggested enhanced solution 
with the following sample input data 
        str = 
“abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
        set = “b”

************************************
PROPOSED SOLUTION FOR FIRST PROBLEM:
************************************
-Using two character flags inChars and notInChars:
        inChars: flag to record the recent character that has been determined 
to be in chars CharSet
        notInChars: flag to record the recent character that has been 
determined to NOT be in chars CharSet
-Adding the following conditions to prevent unnecessary/expensive calls to 
contains() method. Only one call to contains() method is needed starting from 
the second character in the consecutive set of duplicated characters:
        if ((inChars != null) && (ch == inChars))
                this if-statement checks if this ch char has recently been 
determined to be in chars CharSet, this prevents calling contains() more than 
once for a consecutive set of duplicated characters in str
                if one of these conditions fails, it means that the consecutive 
set of duplicated characters for this ch character has not yet been checked for 
this block of duplicated characters (i.e. has not been checked = we have not 
yet done the one-time call to contains method for this particular set of 
duplicated characters)
        if ((notInChars == null) || (ch != notInChars))
                if both conditions are false, it means that ch has recently 
been determined to NOT be in chars CharSet
        if (chars.contains(ch))
                here we call the expensive chars.contains() only once for this 
consecutive set of duplicated characters.
-Note: this suggested solution does not break the functionality of 
org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This 
suggested solution is executed against the latest UnitTest 
CharSetUtilsTest.java written by Apache.




************************************
SECOND PROBLEM: 
************************************
the current algorithm starts the for-loop with i=0 which is not a necessary 
step. This adds the overhead of executing the if condition (i != 0). In this 
logic, the algorithm will always append the first character from str to the 
buffer. So there is no need to execute (i != 0) for every character in the 
consecutive set of duplicated characters.

************************************
PROPOSED SOLUTION FOR SECOND PROBLEM: 
************************************
lastChar is assigned the first character in str and appended to buffer before 
executing the for loop. Then the for loop begins with i=1 removing the 
unnecessary (i != 0) executions.




END OF DESCRIPTION



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to