[perl.git] branch smoke-me/khw-locale created. v5.27.7-159-g69bea2f4c7

Karl Williamson Mon, 08 Jan 2018 22:56:27 -0800

In perl.git, the branch smoke-me/khw-locale has been created

<https://perl5.git.perl.org/perl.git/commitdiff/69bea2f4c70a5875f99d9e09823a6b58da45f399?hp=0000000000000000000000000000000000000000>


        at  69bea2f4c70a5875f99d9e09823a6b58da45f399 (commit)

- Log -----------------------------------------------------------------
commit 69bea2f4c70a5875f99d9e09823a6b58da45f399
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 8 18:21:12 2018 -0700

    locale.c: Revamp fallback detection of UTF-8 locales
    
    This commit continues the process started in the previous few commits to
    improve the detection of whether a locale is UTF-8 or not when the
    platform doesn't have the more modern tools available.
    
    What was done before was examine various texts, like the days of the
    week, in a locale, and see if they are legal UTF-8 or not.  If there
    were any, and all were legal, it assumed that UTF-8 was needed.  If
    there weren't any (as in American English), it looked at the locale's
    name.  This presents false negatives and false positives.
    
    Basically, it adds the constraint that all the texts need to be in the
    same script when interpreted as UTF-8, which basically rules out any
    false positives when the script isn't Latin.  With Latin, it isn't so
    clear cut, as the text can be intermixed with ASCII Latin letters and
    UTF-8 variant sequences that could be some Latin locale, or UTF-8, and
    they just coincidentally happen to be syntactically UTF-8.  Because of
    the structuredness of UTF-8, the odds of a coincidence go down with
    increasing numbers of variants in a row.  This also isn't likely to
    happen with ISO 8859-1, as the bytes that could be legal continuations
    in UTF-8 are almost entirely controls or punctuation.  But in other
    locales in the 8859 series, there are some legal continuations that
    could be part of a month name, say.
    
    As an example of the issues, in 8859-2, one could have \xC6 (C with
    acute) followed by \xB1 (a with ogonek), which in UTF-8 would be
    U+01B1: LATIN CAPITAL LETTER UPSILON.  However, something like \xCD
    (i acute) followed by \xB3 (l with stroke) yields U+0373: GREEK
    SMALL LETTER ARCHAIC SAMPI, and the script check added by this commit
    would catch that.  In non-Latin texts, the only permissible ASCII
    characters would be punctuation, and you aren't going to have many of
    those in the LC_TIME strings, and certainly not in a row.  Instead those
    will consist of at least several variant characters in a row, and the
    odds of those coincidentally being syntactically valid UTF-8 and
    semantically in the same script are exceedingly low.
    
    To catch Latin UTF-8 locales, this commit adds a list of the distinct
    variants found so far.  If there are even just several of these, the
    odds of the syntax being coincidentally UTF-8 greatly diminish.  The
    number needed for this to conclude that the locale is UTF-8, is easily
    tweakable at compile time.
    
    The problem remains for English and other Latin script languages that
    have rare accented characters.  The name is still then examined for
    containing "UTF-8".  Note that previous commits have guaranteed that if
    the locale has a non-ASCII currency symbol that is recognized by
    Unicode, such as the Euro or Pound Sterling, that will correctly be
    recognized.

commit 4b9e9701899b012afa834007c977f716b3ac832d
Author: Karl Williamson <[email protected]>
Date:   Sun Jan 7 16:22:27 2018 -0700

    locale.c: Improved fallback UTF-8 locale detection
    
    This adds some more checks for when the platform lacks mbtowc().  We can
    check if things like isprint(), toupper() match what a UTF-8 locale
    would do.  If not, we can rule out UTF-8.

commit 627c39afde50740adc91b8fecb475d661d3ec659
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 16:00:02 2018 -0700

    Improve fallback UTF-8 locale detection
    
    If the libc doesn't have modern enough routines, we use a fallback
    mechanism to see if a locale is UTF-8 or not.  One component of this is
    to look at the byte sequence for the currency symbol.  Obviously, if the
    sequence isn't valid UTF-8, the locale isn't either.  But if it is valid
    UTF-8, and hence might be a UTF-8 locale, this commit changes the
    detection mechanism to see if the sequence evaluates, when interpreted
    as UTF-8 to be a known Unicode currency symbol.  If so, the locale must
    be UTF-8, as the odds of some other locale having a sequence that does
    this are vanishingly small.
    
    If the sequence doesn't evaluate to a currency symbol, that doesn't tell
    us anything, as plenty of places have a string of letters be their
    currency symbol.  Nor if the symbol is a '$', as that is invariant under
    UTF-8 vs not, so doesn't help us.
    
    This pretty much guarantees that a UTF-8 locale for the European Union
    or the UK that otherwise looks like plain English (Latin script) will be
    properly determined to be UTF-8, as the symbols for their currencies
    will pass this test.

commit b415cafb64652ec024ce4f075cfceb6715cc0042
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 14:24:30 2018 -0700

    locale.c: Avoid localeconv()
    
    my_langinfo() is a recently added function which presents a better API
    than localeconv, and returns the needed information here, and is easier
    to make thread-safe.

commit 2987ed9761bea2f5bc0476049b9002a0a48a89be
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 8 17:37:15 2018 -0700

    locale.c: White-space only
    
    This indents all this code, with no other changes, in preparation for a
    future commit which will add a block around it.

commit 0247adfcb6abd4eb969dad2227cdfb5404072f32
Author: Karl Williamson <[email protected]>
Date:   Sun Jan 7 15:58:52 2018 -0700

    locale.c: Remove branch to label
    
    The code at this label was branched to because it contained common
    cleanup code.  But now that code is in a function, so the cleanup call
    is trivial, so just skip this intermediate label.

commit c8860d3543197d208eb8b98008e1343f6d26dabc
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 12:42:35 2018 -0700

    locale.c: Extract duplicated code into subroutines
    
    These two paradigms are each repeated in 4 places.  Make into two
    subroutines

commit c4f9f16e3728b053c8d48b986a6ed08e13609f4b
Author: Karl Williamson <[email protected]>
Date:   Fri Jan 5 21:41:27 2018 -0700

    locale.c: Prefer mbrtowc(), as its reentrant
    
    If it's available and this is a threaded build, it's preferred.

commit 3621b5634ad1e4d312100e682fc718ab1cddab72
Author: Karl Williamson <[email protected]>
Date:   Sun Jan 7 15:43:01 2018 -0700

    locale.c: White-space only
    
    Indent to correspond with new block from previous commit

commit d75c40dcebea465b3ff8803170835434087b8630
Author: Karl Williamson <[email protected]>
Date:   Fri Jan 5 14:09:40 2018 -0700

    locale.c: Revamp finding if locale is UTF-8
    
    This changes how this functionality works for the LC_CTYPE locale.  On
    systems that have nl_langinfo() one can get a definitive answer from
    just that.  Otherwise (or if that doesn't return properly) one can use
    mbtowc() to check if the UTF-8 byte sequence for the Unicode REPLACEMENT
    CHARACTER actually is considered to be that code point.  This is also
    definitive.  If the maximum byte string length for a character is too
    short to handle all Unicode UTF-8, we know without further checking that
    this isn't a UTF-8 locale, so can avoid the mbtowc check.

commit 277bdec22218560eb157720ab0cbaf058e468dc9
Author: Karl Williamson <[email protected]>
Date:   Sun Jan 7 15:30:06 2018 -0700

    locale.c: Windows will never be EBCDIC
    
    This adjusts the conditional compilation so that win32 is a subset of
    non-EBCDIC.  This will be useful in the next commit.

commit b72d01edddc439f710b7821f0de7a5c11c606b18
Author: Karl Williamson <[email protected]>
Date:   Fri Jan 5 12:57:37 2018 -0700

    locale.c: Simplify expression
    
    Since this is operating on C strings, we don't have to check the
    lengths, but can rely on the underlying functions to work.

commit 347b22e2966f2ff92a7a33cd0207cabfe60f42ed
Author: Karl Williamson <[email protected]>
Date:   Fri Jan 5 11:35:00 2018 -0700

    Change some "shouldn't happen" failures into panics
    
    If the system is so broken that these libc calls are failing, soldiering
    on won't lead to sane results.
    
    THis rewords some existing panics, and adds the errno to the output for
    all of them.

commit fd48e3bd2a5c3b41caae6822b3f27f4174735f01
Author: Karl Williamson <[email protected]>
Date:   Tue Jan 2 16:54:28 2018 -0700

    Cache locale UTF8-ness lookups
    
    Some locales are UTF-8, some are not.  Knowledge of this is needed in
    various circumstances.  This commit saves the results of the last
    several lookups so they don't have to be recalculated each time.
    
    The full generality of POSIX locales is such that you can have error
    messages be displayed in one locale, say Spanish, while other things are
    in French.  To accommodate this generality, the program can loop through
    all the locale categories finding the UTF8ness of the locale it points
    to.  However, in almost all instances, people are going to be in either
    French or in Spanish, and not in some combination.  Suppose it is a
    French UTF-8 locale for all categories.  This new cache will know that
    the French locale is UTF-8, and the queries for all but the first
    category can return that immediately.
    
    This simple cache avoids the overhead of hashes.
    
    This also fixes a bug I realized exists in threaded perls, but haven't
    reproduced.  We do not support locales in such perls, and the user must
    not change the locale or 'use locale'.  But perl itself could change the
    locale behind the scenes, leading to segfaults or incorrect results.
    One such instance is the determination of UTF8ness.  But this only could
    happen if the full generality of locales is used so that the categories
    are not all in the same locale.  This could only happen (if the user
    doesn't change locales) if the environment is such that the perl program
    is started up so that the categories are in such a state.  This commit
    fixes this potential bug by caching the UTF8ness of each category at
    startup, before any threads are instantiated, and so checking for it
    later just looks it up in the cache, without perl changing the locale.

commit b35dbf30440000db2cafa2aab7f3daa722f1de21
Author: Karl Williamson <[email protected]>
Date:   Tue Jan 2 14:23:24 2018 -0700

    locale.c: Avoid duplicate work
    
    As the comments say, the needed value is already readily available

commit 111f218fdc88bb9a3b5b0e90c6d1306cf42205d8
Author: Karl Williamson <[email protected]>
Date:   Tue Jan 2 13:38:16 2018 -0700

    locale.c: Avoid some work
    
    We've already worked out whether the decimal point is a dot or not.  We
    can pass that information to the called routine so it doesn't have to
    figure it out again.

commit c79e033d9f3aa8c1d35429e237fd36dd727e2d1c
Author: Karl Williamson <[email protected]>
Date:   Tue Jan 2 13:19:03 2018 -0700

    locale.c: Use non-control for a format dummy
    
    We need a plain character here.  I used a '\e' before, but it would be
    better to have something that isn't a control, so just change it to a
    blank

commit b9e9751b44055e55fdd60c74cab59a474bb1f741
Author: Karl Williamson <[email protected]>
Date:   Tue Jan 2 12:25:35 2018 -0700

    locale.c: Avoid some more locale changes
    
    In a few places here we can test if we are already in the locale we want
    to be in, and not switch unnecessarily if so.

commit 3d6fc36f1f689f2fb6a7885aba2c0aff76652bf2
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 23:03:34 2018 -0700

    Avoid some unnecessary changing of locales
    
    The LC_NUMERIC locale category is kept so that generally the decimal
    point (radix) is a dot.  For some (mostly) output purposes, it needs to
    be swapped into the program's current underlying locale so that a
    non-dot can be printed.
    
    This commit changes things so that if the current underlying locale uses
    a decimal point, the swap doesn't happen, as it's not needed.

commit 8da47b1a59b417d3ba5f18390f5163d4943cdfdb
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 22:20:25 2018 -0700

    perl.h: White-space only

commit e5c6db574f4fd31a98c12541c9bae3611e0725b9
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 20:41:21 2018 -0700

    locale.c: Add compile check for unimplemented behavior
    
    Instead of silently not working.

commit d8a975d5634fca95738d5621ad5a2cd15cf12379
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 20:30:39 2018 -0700

    locale.c: White-space only
    
    Indent because the previous commit created an enclosing block, and
    add a blank line elsewhere

commit 2c1a8268b7c7f84272553a8bd206bfacf81f2422
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 20:00:03 2018 -0700

    locale.c: Refactor Ultrix code
    
    Examination shows that this code does nothing unless LC_ALL is defined.
    So explicitly test at compile time for that.
    
    Also, two variables don't have to be declared so globally, and by
    reducing their scope, by creating a new block we don't have to have
    PERL_UNUSED_ARG()s for them

commit 98b577f9c8b0e9a8e184806fbba0075a738e93a8
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 19:07:19 2018 -0700

    locale.c: Avoid rescanning a string
    
    We can use a parameter to find out where in the string the portion of
    interest starts.  Do that to avoid starting again from scratch.

commit 98fa0abc4ed36ef88786edd463a1eb0656336106
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 18:33:59 2018 -0700

    locale.c: Use fcns instead of macros
    
    Here the macros being used expand into the functions being called,
    without adding any value to using the macros, and making things slightly
    less clear.

commit 70b24b3198726e96c812d279945618c2467b8af9
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 18:17:41 2018 -0700

    locale.c: Add const to several variables

commit f371a6e2ac9044b89106f3d661668e6e4ea7023e
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 18:15:27 2018 -0700

    locale.c: Improve, add comments

commit 2650121b66d89148588737eddcbab4218977b78b
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 18:01:45 2018 -0700

    perl.h: Add comment, rephrase another

commit dd02bd1a4f5a343a0b1a628d789738b3d0d03268
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 18 17:34:25 2017 -0700

    Perl_langinfo: Teach about YESSTR and NOSTR
    
    These are items that nl_langinfo() used to be required to return, but
    are considered obsolete.  Nonetheless, this drop-in replacement for that
    function should know about them for backward compatibility.

commit 9f34316dc78d382be3b56aaf9405459b4352bfa5
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 1 15:07:45 2018 -0700

    APItest/t/locale.t: Add some tests
    
    This makes sure that the entries for which the expected return value may
    legitimately vary from platform to platform get tested as returning
    something,  skipping the test if the item isn't known on the platform.
    
    A couple of comments are also added.

commit 1a101d79ac6dd65ec025bfaf8f6edafa03b01f9f
Author: Karl Williamson <[email protected]>
Date:   Mon Aug 28 18:01:43 2017 -0600

    XXX may include other things after final edits: 
ExtUtils::ParseXS/lib/perlxs.pod: Nits
    
    This removes extra blanks following colons that don't mean the normal
    thing for colons that traditionally have two spaces after them, and
    capitalizes Perl.

commit 08a890e503909835b78a126387daa3eb31ed52e6
Author: Karl Williamson <[email protected]>
Date:   Wed Jul 26 08:59:33 2017 -0600

    Teach perl about more locale categories
    
    glibc has various other categories than the ones perl handles, for
    example LC_PAPER.  This commit adds knowledge of these to perl, so that
    one can set them, interrogate them, and have libraries work on them,
    even though perl itself does not.
    
    This is in preparation for future commits, where it becomes more
    important than currently for perl to know about all the locale
    categories on the system.
    
    I looked through various other systems to try to find other categories,
    but did not see any.  If a system does have such a category, it is
    pretty easy to tell perl about it, and recompile.  Use the changes in
    this commit as a template, and send an email to [email protected], so
    that the next Perl release will have it.

commit c4004986bd8a62c2ed10d9aedba9c0f87e3eb35c
Author: Karl Williamson <[email protected]>
Date:   Wed Jan 3 20:41:29 2018 -0700

    Add check that "$!" is correctly interpreted as UTF-8
    
    We sometimes need to know if an error message is UTF-8 or not.
    Previously we checked that it is syntactically valid UTF-8, and that the
    LC_MESSAGES locale is UTF-8.  But some systems, notably Windows, do not
    have LC_MESSAGES.  For those, this commit adds a different, semantic,
    check that the text of the message when interpreted as UTF-8 is all in
    the same Unicode script.  This is not foolproof, unlike the LC_MESSAGES
    check, but it's better than what we have now for such systems.  It
    likely is foolproof for non-Latin locales, as any message will have a
    bunch of characters in that locale, and no ASCII Latin ones.  For a
    Latin locale, these ASCII letters could be intermixed with the UTF-8
    ones, causing potential ambiguity.

commit ee5191f4fbfd3c74558e587ae34a509c4b7e0e91
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 14 22:27:06 2017 -0700

    Remove uncompilable code
    
    This code was never compiled because of a misspelling in the #ifdef.
    No problem surfaced, so just remove it.  The next commit adds a different
    check.

commit 1522b18305309bd0efcdee1ea6e4cd0dfb2cf142
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 8 19:11:52 2018 -0700

    XXX rethink empty script_run

commit 33cd9056579e36f1eb1cf6075caeb961edd3d35d
Author: Karl Williamson <[email protected]>
Date:   Mon Jan 8 19:08:54 2018 -0700

    perl.c: Move initialization of inversion lists
    
    This is now done very early in the file, as it may be needed for
    initializing the locale handling.

commit edc987b32e1defd5a20e5945bf64a56005e29ca3
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 21:16:15 2018 -0700

    Give isSCRIPT_RUN() an extra parameter
    
    This allows it to return the script of the run.

commit 9a86f32981d256429099e6ac76f67e792fc9effe
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 16:15:12 2018 -0700

    charclasslists.h: script enums visible to CORE,EXT
    
    This exposes the enum definitions for the script extensions property to
    the perl code and extensions, for use in future commits.

commit 57504bf394fc0fe9694b36b9286d29bc77be5ec6
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 16:13:06 2018 -0700

    regen/mk_invlists.pl: Allow override of where enums get defined
    
    This adds code so that the enums defined by this, which are ordinarily
    only used by regexec.c ban be specified to be somewhere else instead.

commit a8716b7d2d11b0311a33c8e833035f86249da142
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 16:09:57 2018 -0700

    regen/mk_invlists.pl: Allow multiple files to access
    
    This changes the code so that the symbols defined by this program
    can be #define'd in more than one file.

commit 0cd9515ed9729b0004f187a01c2d8f578e42fb76
Author: Karl Williamson <[email protected]>
Date:   Sat Jan 6 16:18:45 2018 -0700

    Fix bug in script runs that start with Common
    
    This is a follow on to 8535a06fea02528fe726855a139fcbd360d1fc6e.  That
    fixed one case where the first character was in the Common script,
    things did not work properly.  It did not catch the case where a future
    character in the string was non-Common from a script that has its own
    set of digits, and this commit fixes that.
    
    This just entails a block of code to slightly earlier.

-----------------------------------------------------------------------

-- 
Perl5 Master Repository

[perl.git] branch smoke-me/khw-locale created. v5.27.7-159-g69bea2f4c7

Reply via email to