In perl.git, the branch smoke-me/khw-new_locale has been created
<http://perl5.git.perl.org/perl.git/commitdiff/afbd0374e5467934a57030136eb9a3132798cb54?hp=0000000000000000000000000000000000000000>
at afbd0374e5467934a57030136eb9a3132798cb54 (commit)
- Log -----------------------------------------------------------------
commit afbd0374e5467934a57030136eb9a3132798cb54
Author: Karl Williamson <[email protected]>
Date: Thu May 19 20:03:06 2016 -0600
temp debug
M locale.c
commit 13cd9f3403c64ca05427bff7e0bf643b0c014009
Author: Karl Williamson <[email protected]>
Date: Fri May 13 11:32:44 2016 -0600
locale.c: Make locale collation predictions adaptive
We try to avoid calling strxfrm() more than needed by predicting its
needed buffer size. This generally works because the size of the
transformed string is roughly linear with the size of the input string.
But the key word here is "roughly". This commit changes things, so that
when we guess low, we change the coefficients in the equation to guess
higher the next time.
M locale.c
commit 3778a9185f5bde6077feab14f748a8b1323cece0
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 14:28:57 2016 -0600
locale.c: Not so aggressive collation memory use guess
On platforms where strxfrm() is not well-behaved, and it fails because
it needs a larger buffer, prior to this commit, the size was doubled
before trying again. This could require a lot of memory on large
inputs. I'm uncomfortable with such a big delta on very large strings.
This commit changes it so it is not so aggressive. Note that this now
only gets called on platforms whose strxfrm() is not well behaved, and I
think the size prediction is better due to a recent commit, and there
isn't really much of a downside in not gobbling up memory so fast.
M locale.c
commit f7f17744dba2e167364f09739d164c7f62a75b56
Author: Karl Williamson <[email protected]>
Date: Wed May 18 13:18:01 2016 -0600
locale.c: Add some debugging statements
M locale.c
commit e0c750a24be340515c407dd1ffb90adc8e16affa
Author: Karl Williamson <[email protected]>
Date: Wed May 18 13:17:25 2016 -0600
locale.c: Minor cleanup
This replaces an expression with what I think is an easier to understand
macro, and eliminates a couple of temporary variables that just
cluttered things up.
M locale.c
commit b5e170845830f04f818a72e04a6ba2e8ac8fd290
Author: Karl Williamson <[email protected]>
Date: Sat May 14 18:23:02 2016 -0600
locale.c: Fix some debugging so will output during init
Because the command line options are currently parsed after the locale
initialization is done, an environment variable is read to allow
debugging of the function that is called to do the initialization.
However, any functions that it calls, prior to this commit, were unaware
of this and so did not output debugging. This commit fixes most of
them.
M locale.c
commit ea4b52b79529272e1817230f1c46046680cad665
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 12:49:36 2016 -0600
mv function from locale.c to mathoms.c
The previous commit causes this function being moved to be just a
wrapper not called in core. Just in case someone is calling it, it is
retained, but moved to mathoms.c
M embed.fnc
M embed.h
M locale.c
M mathoms.c
M proto.h
commit 0f664b45741ec7ece2f83a3fb9e7e6666d9ba446
Author: Karl Williamson <[email protected]>
Date: Tue May 17 20:50:55 2016 -0600
Do better locale collation in UTF-8 locales
strxfrm() works reasonably well on some platforms under UTF-8 locales.
It will assume that every string passed to it is in UTF-8. This commit
changes perl to make sure that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that. If the passed string
contains code points representable only in UTF-8, they are changed into
the highest collating code point that doesn't require UTF-8. This
provides seamless operation, as they end up collating after every
non-UTF-8 code point. If two transformed strings compare equal, perl
already uses the un-transformed versions to break ties, and there, these
faked-up strings will collate after everything else, and in code point
order amongst themselves.
M embed.fnc
M embed.h
M embedvar.h
M intrpvar.h
M lib/locale.t
M locale.c
M pod/perldelta.pod
M pod/perllocale.pod
M proto.h
M sv.c
commit 98c0717af374827c0c1f6b48a9345af3a1388945
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 13:51:48 2016 -0600
perllocale: Change headings so two aren't identical
Two html anchors in this pod were identical, which isn't a problme
unless you try to link to one of them, as the next commit does
M pod/perllocale.pod
commit 865ce47b5b3f15d73f16f00e8cadb72e2a41fc37
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 11:21:40 2016 -0600
Change calculation of locale collation coefficients
Every time a new collation locale is set, two coefficients are calculated
that are used in predicting how much space is needed in the
transformation of a string by strxfrm(). The transformed string is
roughly linear with the the length of the input string, so we are
calcaulating 'm' and 'b' such that
transformed_length = m * input_length + b
Space is allocated based on this prediction. If it is too small, the
strxfrm() will fail, and we will have to increase the allotted amount
and try again. It's better to get the prediction right to avoid
multiple, expensive strxfrm() calls.
Prior to this commit, the calculation was not rigorous, and failed on
some platforms that don't have a fully conforming strxfrm().
This commit changes to not panic if a locale has an apparent defective
collation, but instead silently change to use C-locale collation. It
could be argued that a warning should additionally be raised.
This commit fixes [perl #121734].
M locale.c
M pod/perldelta.pod
commit 3dc247eea6bdd7faf3324dac8fdc3a153ccaef83
Author: Karl Williamson <[email protected]>
Date: Mon Apr 11 19:11:07 2016 -0600
locale.c: Change algorithm for strxfrm() trials
It's kind of guess work deciding how big a buffer to give to strxfrm().
If you give it too small a one, it will fail. Prior to this commit, the
buffer size was doubled and then strxfrm() was called again, looping
until it worked, or we used too much memory.
Each time a new locale is made, we try to minimize the necessity of
doing this by calculating numbers 'm' and 'b' that can be plugged into
the equation
mx + b
where 'x' is the size of the string passed to strxfrm(). strxfrm() is
roughly linear with respect to its input's length, so this generally
works without us having to do many loops to get a large enough size.
But on many systems, strxfrm(), in failing, returns how much space you
should have given it. On such systems, we can just use that number on
the 2nd try and not have to keep guessing. This commit changes to do
that.
But on other systems this doesn't work. So the original method is
retained if we determine that there are problems with strxfrm(), either
from previous experience, or because using the size returned from the
first trial didn't work
M embedvar.h
M intrpvar.h
M locale.c
commit 811bfb90f56d922c9378389ec40c45d7fb1e8e02
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 20:40:48 2016 -0600
locale.c: Free over-allocated space early
We may over malloc some space in buffers to strxfrm(). This frees it
now instead of waiting for the whole block to be freed sometime later.
This can be a significant amount of memory if the input string to
strxfrm() is long.
M locale.c
commit 95e2db4d7bdb04167c0b816519b0e716c8c6c43b
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 20:36:01 2016 -0600
locale.c: White-space only
Outdent and reflow because the previous commit removed an enclosing
block.
M locale.c
commit e5bd1c7aadf571974fd3922b79568f73f622f106
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 15:52:05 2016 -0600
XXX pod, left in generality Change mem_collxfrm() algorithm for embedded
NULs
One of the problems in implementing Perl is that the C library routines
forbid embedded NUL characters, which Perl accepts. This is true for
the case of strxfrm() which handles collation under locale.
The best solution as far as functionality goes, would be for Perl to
write its own strxfrm replacement which would handle the specific needs
of Perl. But that is not going to happen because of the huge complexity
in handling it across many platforms. We would have to know the
location and format of the locale definition files for every such
platform. Some might follow POSIX guidelines, some might not.
strxfrm creates a transformation of its input into a new string
consisting of weight bytes. In the typical but general case, a 3
character NUL-terminated input string 'A B C 00' (spaces added for
readability) gets transformed into something like:
A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
where the superscripted characters are weights for the corresponding
input characters. Superscript 1 represents the primary sorting key; 2,
the secondary, etc, for as many levels as the locale definition gives.
The 01 byte is likely to be the separator between levels, but not
necessarily, and there could be some other mechanisms used on various
platforms.
To handle embedded NULs, the simplest thing would be to just remove them
before passing in to strxfrm(). Then they would be entirely ignored,
which might not be what you want. You might want them to have some
weight at the tertiary level, for example. It also causes problems
because strxfrm is very context sensitive. The locale definition can
define weights for specific sequences of any length (and the weights can
be multi-byte), and by removing a NUL, two characters now become
adjacent that weren't in the input, and they could now form one of those
special sequences and thus throw things off.
Another way to handle NULs, that seemingly ignores them, but actually
doesn't, is the mechanism in use prior to this commit. The input string
is split at the NULs, and the substrings are independently passed to
strxfrm, and the results concatenated together. This doesn't work
either. In our example 'A B C 00', suppose B is a NUL, and should have
some weight at the tertiary level. What we want is:
A¹ C¹ 01 A² C² 01 A³ B³ C³ 00
But that's not at all what you get. Instead it is:
A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
The primary weight of C comes immediately after the teriary weight of A,
but more importantly, a NUL, instead of being ignored at the primary
levels, is significant at all levels, so that "a\0c" would sort before
"ab".
Still another possibility is to replace the NUL with some other
character before passing it to strxfrm. That was my original plan, to
replace each NUL with the character that this code determines has the
lowest collation order for the current locale. On strings that don't
contain that character, the results would be as good as it gets for that
locale. That character is likely to be ignored at higher weight levels,
but have some small non-ignored weight at the lowest ones. And
hopefully the character would rarely be encountered in practice. When
it does happen, it and NUL would sort identically; hardly the end of the
world. If the entire strings sorted identically, the NUL-containing one
would come out before the other one, since the original Perl strings are
used as a tie breaker. However, testing showed a problem with this. If
that other character is part of a sequence that has special weighting,
the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the
lowest collating character in many UTF-8 locales. It combines in
Romanian and Vietnamese with some other characters to change weights,
and hence changing NULs into it screws things up.
What I finally have come to is to do is a modification of this final
approach, where the possible NUL replacements are limited to just
characters that are controls in the locale. NULs are replaced by the
lowest collating control. It would really be a defective locale if this
control combined with some other character to form a special sequence.
Often the character will be a 01, START OF HEADING. In the very
unlikely case that there are absolutely no controls in the locale, 01 is
used, because SOMETHING has to be.
The code added by this commit is mostly utf8-ready. A few commits from
now will make Perl properly work with UTF-8 (if the platform supports
it). But until that time, this isn't a full implementation; it only
looks for the lowest-sorting control that is invariant, where the
the UTF8ness doesn't matter.
M embed.fnc
M embedvar.h
M intrpvar.h
M lib/locale.t
M locale.c
M pod/perldelta.pod
M pod/perllocale.pod
M proto.h
commit f43d6f7b36558c0818c7bcc3ef00b21704bc9363
Author: Karl Williamson <[email protected]>
Date: Tue May 17 21:53:53 2016 -0600
locale.c: Add, move, clarify comments
This moves a large block of comments to before a block, outdents it, and
adds to it, plus adding another comment
M locale.c
commit 66b48b5d307f53df9ca8e631c6146864e12797c1
Author: Karl Williamson <[email protected]>
Date: Mon May 16 15:19:14 2016 -0600
Keep track of if collation locale is UTF-8 or not
This will be used in future commits
M embedvar.h
M intrpvar.h
M locale.c
M sv.c
commit 022c3cea46bc737580fc97e28ff875dc7628afc6
Author: Karl Williamson <[email protected]>
Date: Mon May 16 15:15:26 2016 -0600
locale.c: Don't use special locale collation for C locale
We can skip all the locale collation calculations if the locale we are
in is C or POSIX.
M locale.c
commit 968e5483b78567e1dc71642588edad96c8614efe
Author: Karl Williamson <[email protected]>
Date: Fri May 13 11:51:55 2016 -0600
lib/locale.t: Don't calculate value unless needed
M lib/locale.t
-----------------------------------------------------------------------
--
Perl5 Master Repository