Re: CharArraySet makes excessive use of Character.toLowerCase

Shai Erera Sun, 28 Dec 2008 22:58:56 -0800

Ok it might not be that easy to make the changes.

While add() clearly states that the passed char[] may be modified if
ignoreCase is true, it is not the case for contains(char[]). If I first
lower case the passed in char[], it modifies the instance the application
holds, and then the "Now" will be converted to "now"
(TestStopFilter.testIgnoreCase checks that).


What happens now is that Character.toLowerCase lowercases the character, but
does not modify the array.

I still think we should solve it, because there's no point lowercasing the
the characters in getHashCode and equals for the add methods. However, I
assume that the contains() method is called more often, and therefore it
should be even more efficient than the add() ones. So allocating an internal
char[] for contains is out of the question, but I wonder if we can document
that contains(char[]) may change the array (I doubt it though, since it
means an API change).

I'll think about it more, but I'm open for suggestions.

Shai

On Mon, Dec 29, 2008 at 8:29 AM, Shai Erera <ser...@gmail.com> wrote:

> Thanks,
>
> I'll open an issue and create a patch
>
>
> On Sun, Dec 28, 2008 at 5:55 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>>
>> I think all of these changes make sense!
>> I would just remove that if (false &&...) dead code.
>>
>> Mike
>>
>>
>> Shai Erera <ser...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I noticed that CharArraySet uses Character.toLowerCase in many places in
>>> the code. I think it uses it too much unnecessarily.
>>> For example, the equals(char[], int, int, char[]) method converts the
>>> characters to lower case if ignoreCase is true, although it could have been
>>> converted in one of the appropriate public APIs (like add or contains).
>>> Moreover, the equals() method may be called several times during the same
>>> add/contains call, and the value will still be lowercased every time.
>>>
>>> The same logic cannot be applied on the CharSequence contains API though,
>>> for efficiency reasons. The rest (including all the add() methods) can be
>>> changed to first convert the value to lowercase form (if ignoreCase is
>>> false) and then simplify getSlot() and equals() by removing the ignoreCase
>>> check from them.
>>>
>>> What do you think? I can prepare a patch if you agree to make this
>>> change.
>>>
>>> Also, few minor changes we can push through in this patch are:
>>> One
>>> ------
>>> add(Object) looks like this:
>>>     if (o instanceof char[]) {
>>>       return add((char[])o);
>>>     } else if (o instanceof String) {
>>>       return add((String)o);
>>>     } else if (o instanceof CharSequence) {
>>>       return add((CharSequence)o);
>>>     } else {
>>>       return add(o.toString());
>>>     }
>>> I don't think there's a need to check for instanceof String. Instead it
>>> could be written simply like this:
>>>     if (o instanceof char[]) {
>>>       return add((char[])o);
>>>     }
>>>     return (add(o.toString());
>>> The reason is that add(String) and add(CharSequence) eventually do the
>>> same. If the identification of CharSequence is still important, it can be
>>> written like this:
>>>     if (o instanceof char[]) {
>>>       return add((char[])o);
>>>     } else if (o instanceof CharSequence) {
>>>       return add((CharSequence)o);
>>>     } else {
>>>       return add(o.toString());
>>>     }
>>>
>>> Two
>>> ------
>>> getHashCode(CharSequence) has this check:
>>>       if (false && text instanceof String) {
>>>         code = text.hashCode();
>>> Why is it there? It will always fail.
>>>
>>> Shai
>>>
>>
>>
>

Re: CharArraySet makes excessive use of Character.toLowerCase

Reply via email to