[
https://issues.apache.org/jira/browse/LANG-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mohammed Alfallaj updated LANG-1206:
------------------------------------
Description:
A readable form of the description is attached in Description_Squeeze.docx.
Below is a text format of the description (in case the .docx does not work):
***********************************************************************
Performance problems of current CharSetUtils.squeeze() algorithm:
***********************************************************************
************************************
FIRST PROBLEM:
************************************
-Method org.apache.commons.lang3.CharSet.contains() is called in each character
iteration for a consecutive set of duplicated characters in str (starting from
the second character in the duplicated set). This large number of calls to
contains() are determined to increase the performance overhead for squeeze()
calls given the high computing cost of executing contains() method. Note that
only one call to contains() is needed for the second character in this set of
duplicated characters. There is no need to repeat the calls to contains()
method for the same duplicated character that we have already checked in the
second occurrence. For example, if str=”abbbbbc” and set=”b”, then the
algorithm should call contains() only once for the second b character
occurrence and prevent calling contains() again given that it has already been
called before for this block. The current algorithm adds high performance
overhead by calling contains() repetitively for each duplicated b character
starting from the second occurrence. For this example where str=”abbbbbc” and
set=”b”, the suggested solution will call contains() method only once instead
of 4 times which will largely reduce the performance overhead of squeeze()
method.
-Consider the following example, System.currentTimeMillis() testing shows a
performance improvement by around 83% if we run the suggested enhanced solution
with the following sample input data
str =
“abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
set = “b”
************************************
PROPOSED SOLUTION FOR FIRST PROBLEM:
************************************
-Using two character flags inChars and notInChars:
inChars: flag to record the recent character that has been determined
to be in chars CharSet
notInChars: flag to record the recent character that has been
determined to NOT be in chars CharSet
-Adding the following conditions to prevent unnecessary/expensive calls to
contains() method. Only one call to contains() method is needed starting from
the second character in the consecutive set of duplicated characters:
if ((inChars != null) && (ch == inChars))
this if-statement checks if this ch char has recently been
determined to be in chars CharSet, this prevents calling contains() more than
once for a consecutive set of duplicated characters in str
if one of these conditions fails, it means that the consecutive
set of duplicated characters for this ch character has not yet been checked for
this block of duplicated characters (i.e. has not been checked = we have not
yet done the one-time call to contains method for this particular set of
duplicated characters)
if ((notInChars == null) || (ch != notInChars))
if both conditions are false, it means that ch has recently
been determined to NOT be in chars CharSet
if (chars.contains(ch))
here we call the expensive chars.contains() only once for this
consecutive set of duplicated characters.
-Note: this suggested solution does not break the functionality of
org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This
suggested solution is executed against the latest UnitTest
CharSetUtilsTest.java written by Apache.
-Note: There is an issue created in 10/Jul/2011 titled "CharSetUtils.squeeze()
speedup" LANG-715. This solution does not prevent the repetitive calls to the
expensive contains() method.
************************************
SECOND PROBLEM:
************************************
the current algorithm starts the for-loop with i=0 which is not a necessary
step. This adds the overhead of executing the if condition (i != 0). In this
logic, the algorithm will always append the first character from str to the
buffer. So there is no need to execute (i != 0) for every character in the
consecutive set of duplicated characters.
************************************
PROPOSED SOLUTION FOR SECOND PROBLEM:
************************************
lastChar is assigned the first character in str and appended to buffer before
executing the for loop. Then the for loop begins with i=1 removing the
unnecessary (i != 0) executions.
END OF DESCRIPTION
was:
A readable form of the description is attached in Description_Squeeze.docx.
Below is a text format of the description (in case the .docx does not work):
***********************************************************************
Performance problems of current CharSetUtils.squeeze() algorithm:
***********************************************************************
************************************
FIRST PROBLEM:
************************************
-Method org.apache.commons.lang3.CharSet.contains() is called in each character
iteration for a consecutive set of duplicated characters in str (starting from
the second character in the duplicated set). This large number of calls to
contains() are determined to increase the performance overhead for squeeze()
calls given the high computing cost of executing contains() method. Note that
only one call to contains() is needed for the second character in this set of
duplicated characters. There is no need to repeat the calls to contains()
method for the same duplicated character that we have already checked in the
second occurrence. For example, if str=”abbbbbc” and set=”b”, then the
algorithm should call contains() only once for the second b character
occurrence and prevent calling contains() again given that it has already been
called before for this block. The current algorithm adds high performance
overhead by calling contains() repetitively for each duplicated b character
starting from the second occurrence. For this example where str=”abbbbbc” and
set=”b”, the suggested solution will call contains() method only once instead
of 4 times which will largely reduce the performance overhead of squeeze()
method.
-Consider the following example, System.currentTimeMillis() testing shows a
performance improvement by around 83% if we run the suggested enhanced solution
with the following sample input data
str =
“abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
set = “b”
************************************
PROPOSED SOLUTION FOR FIRST PROBLEM:
************************************
-Using two character flags inChars and notInChars:
inChars: flag to record the recent character that has been determined
to be in chars CharSet
notInChars: flag to record the recent character that has been
determined to NOT be in chars CharSet
-Adding the following conditions to prevent unnecessary/expensive calls to
contains() method. Only one call to contains() method is needed starting from
the second character in the consecutive set of duplicated characters:
if ((inChars != null) && (ch == inChars))
this if-statement checks if this ch char has recently been
determined to be in chars CharSet, this prevents calling contains() more than
once for a consecutive set of duplicated characters in str
if one of these conditions fails, it means that the consecutive
set of duplicated characters for this ch character has not yet been checked for
this block of duplicated characters (i.e. has not been checked = we have not
yet done the one-time call to contains method for this particular set of
duplicated characters)
if ((notInChars == null) || (ch != notInChars))
if both conditions are false, it means that ch has recently
been determined to NOT be in chars CharSet
if (chars.contains(ch))
here we call the expensive chars.contains() only once for this
consecutive set of duplicated characters.
-Note: this suggested solution does not break the functionality of
org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This
suggested solution is executed against the latest UnitTest
CharSetUtilsTest.java written by Apache.
************************************
SECOND PROBLEM:
************************************
the current algorithm starts the for-loop with i=0 which is not a necessary
step. This adds the overhead of executing the if condition (i != 0). In this
logic, the algorithm will always append the first character from str to the
buffer. So there is no need to execute (i != 0) for every character in the
consecutive set of duplicated characters.
************************************
PROPOSED SOLUTION FOR SECOND PROBLEM:
************************************
lastChar is assigned the first character in str and appended to buffer before
executing the for loop. Then the for loop begins with i=1 removing the
unnecessary (i != 0) executions.
END OF DESCRIPTION
> CharSetUtils.squeeze() Performance Issue
> ----------------------------------------
>
> Key: LANG-1206
> URL: https://issues.apache.org/jira/browse/LANG-1206
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
> Affects Versions: 3.4
> Reporter: Mohammed Alfallaj
> Priority: Minor
> Labels: patch, performance
> Fix For: 3.4
>
> Attachments: CharSetUtils_squeeze_performance_issue.patch,
> Description_Squeeze.docx
>
>
> A readable form of the description is attached in Description_Squeeze.docx.
> Below is a text format of the description (in case the .docx does not work):
> ***********************************************************************
> Performance problems of current CharSetUtils.squeeze() algorithm:
> ***********************************************************************
> ************************************
> FIRST PROBLEM:
> ************************************
> -Method org.apache.commons.lang3.CharSet.contains() is called in each
> character iteration for a consecutive set of duplicated characters in str
> (starting from the second character in the duplicated set). This large number
> of calls to contains() are determined to increase the performance overhead
> for squeeze() calls given the high computing cost of executing contains()
> method. Note that only one call to contains() is needed for the second
> character in this set of duplicated characters. There is no need to repeat
> the calls to contains() method for the same duplicated character that we have
> already checked in the second occurrence. For example, if str=”abbbbbc” and
> set=”b”, then the algorithm should call contains() only once for the second b
> character occurrence and prevent calling contains() again given that it has
> already been called before for this block. The current algorithm adds high
> performance overhead by calling contains() repetitively for each duplicated b
> character starting from the second occurrence. For this example where
> str=”abbbbbc” and set=”b”, the suggested solution will call contains() method
> only once instead of 4 times which will largely reduce the performance
> overhead of squeeze() method.
> -Consider the following example, System.currentTimeMillis() testing shows a
> performance improvement by around 83% if we run the suggested enhanced
> solution with the following sample input data
> str =
> “abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
> set = “b”
> ************************************
> PROPOSED SOLUTION FOR FIRST PROBLEM:
> ************************************
> -Using two character flags inChars and notInChars:
> inChars: flag to record the recent character that has been determined
> to be in chars CharSet
> notInChars: flag to record the recent character that has been
> determined to NOT be in chars CharSet
> -Adding the following conditions to prevent unnecessary/expensive calls to
> contains() method. Only one call to contains() method is needed starting from
> the second character in the consecutive set of duplicated characters:
> if ((inChars != null) && (ch == inChars))
> this if-statement checks if this ch char has recently been
> determined to be in chars CharSet, this prevents calling contains() more than
> once for a consecutive set of duplicated characters in str
> if one of these conditions fails, it means that the consecutive
> set of duplicated characters for this ch character has not yet been checked
> for this block of duplicated characters (i.e. has not been checked = we have
> not yet done the one-time call to contains method for this particular set of
> duplicated characters)
> if ((notInChars == null) || (ch != notInChars))
> if both conditions are false, it means that ch has recently
> been determined to NOT be in chars CharSet
> if (chars.contains(ch))
> here we call the expensive chars.contains() only once for this
> consecutive set of duplicated characters.
> -Note: this suggested solution does not break the functionality of
> org.apache.commons.lang3.CharSetUtils.squeeze(String str, String… set). This
> suggested solution is executed against the latest UnitTest
> CharSetUtilsTest.java written by Apache.
> -Note: There is an issue created in 10/Jul/2011 titled
> "CharSetUtils.squeeze() speedup" LANG-715. This solution does not prevent the
> repetitive calls to the expensive contains() method.
> ************************************
> SECOND PROBLEM:
> ************************************
> the current algorithm starts the for-loop with i=0 which is not a necessary
> step. This adds the overhead of executing the if condition (i != 0). In this
> logic, the algorithm will always append the first character from str to the
> buffer. So there is no need to execute (i != 0) for every character in the
> consecutive set of duplicated characters.
> ************************************
> PROPOSED SOLUTION FOR SECOND PROBLEM:
> ************************************
> lastChar is assigned the first character in str and appended to buffer before
> executing the for loop. Then the for loop begins with i=1 removing the
> unnecessary (i != 0) executions.
> END OF DESCRIPTION
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)