What is going to happen when one compares two general Unicode series of characters that represent the same string but differ in normalization? Wouldn't the size test would result in false negatives?

http://unicode.org/reports/tr15/

I'm asking because I haven't seen any discussion on the subject, and the decision to change the code as proposed could have side effects.

On 5/27/14 11:59 , Eliot Miranda wrote:



On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists)
<[email protected] <mailto:[email protected]>> wrote:

    __

    Quoting Eliot Miranda <[email protected]
    <mailto:[email protected]>>:

    Hi Phillipe,


    On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall
    <[email protected]
    <mailto:[email protected]>> wrote:

        Hi

        I have been investigating why Dictionary look up performance
        with String keys is not as good as I would expected. Something
        I noted is that String >> #= is implemented in terms of
        #compare:with:collated:. There is no short circuit if Strings
        are not the same size. In my case some Strings have the same
        prefix but a different length eg 'Content-Type' and
        'Content-Length'. In that case a #compare:with:collated: is
        performed even though we know in advance the answer will be
        false because they have different sizes.

    Why not rewrite
    String>>= aString
    "Answer whether the receiver sorts equally as aString.
    The collation order is simple ascii (with case differences)."
    aString isString ifFalse: [ ^ false ].
    ^ (self compare: self with: aString collated: AsciiOrder) = 2
    as
    String>>= aString
    "Answer whether the receiver sorts equally as aString.
    The collation order is simple ascii (with case differences)."
    (aString isString
    and: [self size = aString size]) ifFalse: [^false].
    ^ (self compare: self withSize: with: aString collated:
    AsciiOrder) = 2
    ?


This makes a huge difference, over 3 times faster:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(13726 4467)
4467 - 13726 / 137.26 -67.46%

    One /could/ add a replacement compare:with:collated:
    primitive primitiveCompareString which took the sizes as arguments
    to avoid asking twice.  But it wouldn't be safe.  One could abuse
    the primitive and lie about the size.  So I suspect it is best to
    add the size check to String>>#= and accept the duplication of
    the primitive finding the sizes of the two strings.  The cost in
    the primitive is minimal.  A WideString version of the primitive
    might pay its way, but if Spur and Sista arrive soon the primitive
    shouldn't be faster than the optimised Smalltalk code.
    --
    best,
    Eliot

    BTW, any good reason for not prefixing all the implementors of #=
    with this?

    "Any object is equal to itself"
    self == argument ifTrue: [ ^ true ].


It doesn't make much difference:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(4628 4560)

4560 - 4628 / 46.28 -1.47%

So is it worth it?  If you feel it is I've no objection other than it
feels a little kludgey for such little benefit.  And there are the
Symbols if one needs quick comparison and can bear the cost of slow
interning.
--
best,
Eliot


Reply via email to