[perl.git] branch smoke-me/khw-new_locale, created. v5.23.9-106-gda33027

Karl Williamson Tue, 12 Apr 2016 14:02:55 -0700

In perl.git, the branch smoke-me/khw-new_locale has been created

<http://perl5.git.perl.org/perl.git/commitdiff/da33027a6036e566ad2cee80f0c46069765b45f8?hp=0000000000000000000000000000000000000000>


        at  da33027a6036e566ad2cee80f0c46069765b45f8 (commit)

- Log -----------------------------------------------------------------
commit da33027a6036e566ad2cee80f0c46069765b45f8
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 14:28:57 2016 -0600

    locale.c: XXX
    
    When strxfrm() fails due to needing a larger buffer, prior to this
    commit, the size was doubled before trying again.  This could require a
    lot of memory on large inputs.  This commit changes it so it is not so
    aggressive.  There are two changes that I think warrant this.  One is
    that the formula that is computed about how large to guess is needed is
    more accurate than before, and so our guess is not likely to be that far
    off.  The second is that on many platforms, we can calculate precisely
    how much is needed, which we do after two times of it not being enough.

M       locale.c

commit cbe1b32c6233393dbd15fc8d710e5d1cb72a91e9
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 14:26:53 2016 -0600

    locale.c: Add some debugging statements

M       locale.c

commit b58984751b073b0f75e414c766c2f430f46e4f92
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 14:19:21 2016 -0600

    locale.c: Fix some debugging so will output during initialization
    
    Because the command line options are currently parsed after the locale
    initialization is done, an environment variable is read to allow
    debugging of the function that is called to do the initialization.
    However, any functions that it calls, prior to this commit, were unaware
    of this and so did not output debugging.  This commit fixes most of
    them.

M       locale.c

commit 1750a56a7ac354962900b6b24411ef67d12fe4a9
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 13:54:32 2016 -0600

    perllocale: Document collation changes

M       pod/perllocale.pod

commit 4daf2d7642811f18077b390bbbbd50f445af31bd
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 13:51:48 2016 -0600

    perllocale: Change headings so two aren't identical
    
    Two html anchors in this pod were identical, which isn't a problme
    unless you try to link to one of them.

M       pod/perllocale.pod

commit 4694bff19d6283907d39b5b6d4371fefa89d8509
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 13:50:12 2016 -0600

    perllocale: Unicode has changed their data; fix references
    
    We say something here that is no longer true; update it.

M       pod/perllocale.pod

commit 7ab4dc000974ad39ce5527a1ba246332decef1ba
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 12:49:36 2016 -0600

    mv function from locale.c to mathoms.c
    
    The previous function causes this function being moved to be just a
    wrapper not called in core.  Just in case someone is calling it, it is
    retained, but moved to mathoms.c

M       embed.fnc
M       embed.h
M       locale.c
M       mathoms.c
M       proto.h

commit 2ab93c0f9b8b33f7464fbb533086c16bc394a11e
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 12:17:48 2016 -0600

    Do better locale collation in UTF-8 locales
    
    strxfrm() works reasonably well on some platforms under UTF-8 locales.
    It will assume that every string passed to it is in UTF-8.  This commit
    changes perl to make sure that strxfrm's expectations are met.
    
    Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
    string.   And this commit makes sure of that.  If the passed string
    contains code points representable only in UTF-8, they are changed into
    the highest collating code point that doesn't require UTF-8.  This
    provides seamless operation, as they end up collating after every
    non-UTF-8 code point.  If two transformed strings compare equal, perl
    already uses the un-transformed versions to break ties, and there, these
    faked-up strings will collate after everything else, and in code point
    order amongst themselves.

M       embed.fnc
M       embed.h
M       embedvar.h
M       intrpvar.h
M       locale.c
M       proto.h
M       sv.c

commit e445d2a3a1fe1fe6519097c257bd3cc149267923
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 11:21:40 2016 -0600

    XXX delta, RT Change calculation of locale collation constants
    
    Every time a new collation locale is set, two constants are calculated
    that are used in an equation determining how much space to pre-allocate
    for the results of strxfrm() on strings that need to be collated.  If
    these are too small, it is not the end of the world, as we will increase
    the space until we have enough.  But each time we do, we have to throw
    away an expensive strxfrm() result.  So it's good to get these constants
    set so that rarely happens.
    
    The transformed string is roughly linear with the the length of the
    input string, so we are calcaulating 'm' and 'b' such that
    
        transformed_length = m * input_length + b
    
    Prior to this commit, the calculation was not rigorous, and failed on
    some platforms that don't have a fully conforming strxfrm().
    
    This commit changes to not panic if a locale has an apparent defective
    collation, but instead silently ignores it.  It could be argued that a
    warning should instead be raised.
    
    This commit fixes [perl #121734].

M       locale.c

commit 25581c95516786524fb91191f7cf4b2c50504493
Author: Karl Williamson <[email protected]>
Date:   Mon Apr 11 19:11:07 2016 -0600

    locale.c: Change algorithm for strxfrm() trials
    
    It's kind of guess work deciding how big a buffer to give to strxfrm().
    If you give it too small a one, it will fail.  Prior to this commit, the
    buffer size was doubled and then strxfrm() was called again, looping
    until it worked, or we used too much memory.
    
    Each time a new locale is made, we try to minimize the necessity of
    doing this by calculating numbers 'm' and 'b' that can be plugged into
    the equation
    
        mx + b
    
    where 'x' is the size of the string passed to strxfrm().  strxfrm() is
    roughly linear with respect to its input's length, so this generally
    works without us having to do many loops to get a large enough size.
    
    But on many systems, there is a better method.  If you pass NULL as the
    output buffer, strxfrm() will calculate exactly how much space it needs,
    so you don't have to keep trying.  There are two glitches.  If we did
    this on all inputs, all would have to call strxfrm twice, a potentially
    expensive operation.  The other glitch is that this doesn't work on all
    systems.
    
    If we have calculated 'm' and 'b' well enough, the transformation will
    succeed the first time; and if it doesn't it's likely to be close, so
    the 2nd time through the loop will almost certainly be enough.
    
    But if not, this commit changes things so that the 3rd time it tries the
    NULL buffer method to get an exact value without having to keep trying
    over and over.  For strxfrms that don't work well for this, we use the
    old method, increasing the buffer and trying again and again until we
    get it right.

M       locale.c

commit d522251feed94f30a9598379df7d42c91cc8f876
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 20:40:48 2016 -0600

    locale.c: Free over-allocated space early
    
    We may over malloc some space in buffers to strxfrm().  This frees it
    now instead of waiting for the whole block to be freed sometime later.
    This can be a significant amount of memory if the input string to
    strxfrm() is long.

M       locale.c

commit 54a55c7b8bec038f3b2473f3e622464005fe4941
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 20:36:01 2016 -0600

    locale.c:  White-space only
    
    Outdent and reflow because the previous commit removed an enclosing
    block.

M       locale.c

commit 52f0eef2e26e9e581cb95d93f32fe243aba88391
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 15:52:05 2016 -0600

    Use different algorithm in mem_collxfrm() to handle embedded NULs
    
    Perl uses strxfrm() to handle collation.  This C library function
    expects a NUL-terminated input string.  But Perl accepts interior NUL
    charaters, so something has to happen.
    
    Until this commit, what happened was that each NUL-terminated
    sub-segment would be individually passed to strxfrm(), with the results
    concatenated together to form the transformation of the whole string
    with NULs ignored.  But this isn't guaranteed to give good results, as
    strxfrm() is highly context sensitive, and needs the whole string, not
    segments, to work properly.  The result of strxfrm() is likely to be
    something like several modified copies of the entire input string
    concatenated together, with the first copy being the primary collation
    weights, the second being the secondary weights, etc.  Giving strxfrm()
    only substrings defeats this.
    
    Another possibility is to just remove the NULs before transforming the
    string.  The problem with this method is that in some locales, two
    adjacent characters can behave differently than if they were separated,
    so removing the NUL screws up the context strxfrm() may need.
    
    What this commit does is to change to replace each NUL with a \001.
    This is almost certainly going to behave like we expect a NUL would if
    it were legal.  Just about every locale treats low code points as
    controls, to be ignored in primary weighting, and perhaps secondary as
    well.
    
    If two strings compare identically, and one had initially NULs, and the
    other \001's, then the tie breaker is to compare the original strings,
    so the NUL string would sort earlier than the \001 one.  Hence this
    method gives the desired results.
    
    As stated in the comments, we could go through the first 256 code points
    to determine the lowest collating one, instead of assuming it is \001.
    But this is a lot of work (UTF-8ness must be considered) and it will be
    extremely rare that the answer isn't going to be \001.

M       embed.fnc
M       locale.c
M       proto.h

commit 1d34b8f6a6cd45c0c10687bb07295ae47918f866
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 15:03:48 2016 -0600

    locale.c, sv.c: Add some comments
    
    And a couple empty lines

M       locale.c
M       sv.c

commit 734501fecea21dc2bf9823e600ba33b9a190bd1a
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 15:16:59 2016 -0600

    locale.c: Some nano-optimizations
    
    Reorder two branches so the most likely is tested before the much less
    likely, and add some UNLIKELY()

M       locale.c

commit b65815a41a991d501e5dcc721182744d7b57ee7d
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 14:47:21 2016 -0600

    locale.c: Clarify a debugging statement

M       locale.c

commit 2ddc0e81e2e57377903d7e66f6a5528715a9d0e5
Author: Karl Williamson <[email protected]>
Date:   Fri Apr 8 13:46:24 2016 -0600

    XXX 5.25 strxfrm, cautions

M       ext/POSIX/lib/POSIX.pod

commit 79d77d930e9db6edf83188ff39029f1e47843c95
Author: Tony Cook <[email protected]>
Date:   Mon Mar 21 12:12:58 2016 +1100

    add d_duplocale and i_locale Configure probes

M       Configure
M       Cross/config.sh-arm-linux
M       NetWare/config.wc
M       Porting/Glossary
M       Porting/config.sh
M       config_h.SH
M       configure.com
M       plan9/config_sh.sample
M       symbian/config.sh
M       uconfig.h
M       uconfig.sh
M       uconfig64.sh
M       win32/config.ce
M       win32/config.gc
M       win32/config.vc
-----------------------------------------------------------------------

--
Perl5 Master Repository

[perl.git] branch smoke-me/khw-new_locale, created. v5.23.9-106-gda33027

Reply via email to