On 06/04/2012 06:03 AM, Linda A. Walsh wrote: > > > Pádraig Brady wrote: >> On 06/03/2012 11:13 PM, Linda Walsh wrote: >>> Within in the past few years, use of ranges in RE's has become >>> unreliable due to some locale changes sorting their native character >>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z). >>> >>> There seems to be a problem in when a user has set their system to use >>> Unicode, it is no longer using the locale specific character set >>> (iso-8859-x, >>> or others). > ---- > To clarify my above statement: > > > There seems to be a problem in when a user has set their system to use > Unicode: It is no longer using the locale specific character set (iso-8859-x, > or others) -- ***or*** *their* *orderings*. I.e. Unicode defines a collation > order -- I don't know that they others do ('C' does, but I don't know about > other locale-specific character sets). > > >> It's not specific to "unicode". Sorting in a iso-8859-1 charset >> results in locale ordering: > ---- > Can you cite a source specifying the sort/collation order of the > iso-8859-1 charset that would prove that it is not-conforming to the collation > specification for that charset? > > I.e. If there is no official source, then the order with that charset > is "undefined", and while it may not be desirable, returning a<A<b<B, would > not > be "an error".
It's a charset. Of course the order is defined. Try: man iso-8859-1 The relative ordering can be trivially inferred from the command I presented. But to be explicit: $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1 a A á b $ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=C sort | iconv -f iso-8859-1 A a b á > > > > >>> http://unicode.org/charts/case/chart_Latin.htm. >> >> http://unicode.org/charts/case/chart_Latin.html > --- > ^^Correct^^ (typho) > >>> Temporarily ignoring accents, only talking about lower and upper >>> case letters, ... >> >> Well case comparison is a complicated area. > ---- > A bit, but it's mostly just wrong in the gnu library concerning unicode, > and, > as you are pointing out -- the 'C' encoding as well. > the 'C' locale was the original charset used by the 'C' language -- only 8 > bits > wide. > > So how can it sort characters beyond the lower 256? > This would seem to be meaningless and bugs output. http://www.pixelbeat.org/docs/utf8_programming.html > Is it?... When the case comparison ordering is specified in a > standard, it makes it fairly clear that one is either compliant with the > standard > or not. > > In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant. > > What happens in other charsets may or may not be covered under some > other standard -- e.g. the 'C'/ascii ordering is specified. But I don't know > if others have relevant standards or not. > >> >> For the special case of discounting accented chars etc. >> you can use an attribute of the well designed UTF-8. > --- > This is not exactly the point -- the point is that the core sort > DOESN'T use that ordering. That's the bug I am reporting. Well you can't generally exclude accents. > > In reporting this, I'm trying to keep the argument 'simple' and focus on > the problem of widely used ranges in the first 256 code-points of > Unicode. > > Unicode gives a fairly extensive algorithm for handling accents, > but I didn't want to complicate the discussion by "going there". Please > focus this bug on the lower 128 code points, as full unicode compliance > with the full collation algorithm that is specified is likely to be a > larger task. HOWEVER, fixing the sorting/collation order of the lower > 127 code points, is, comparatively a small task that conceivably could be > fixed in the next release. lower 127 = ASCII. If your input data is ASCII, just use LC_ALL=C. >> Enabling traditional byte comparison on (normalized) UTF-8 data >> will result in data sorted in Unicode code point order: >> A b a á => A a b á > > But you are missing the point (as well as raising an interesting > 'feature'(?bug?)). > > How is it that 'C' collation collates characters that are outside the ascii > range? Well whether C should be a "unicode" or "ascii" charset is a whole different kettle of fish. I was just referring (as per the link above), that UTF8 is well designed so that it works with many traditional single byte functions. > I.e. -- you can't interpret input data as 'unicode' in the 'C' locale. > So how does this work in the 'C' local? AND more importantly -- it SHOULD > work > when charset is unicode (UTF-8)... and does not. Test prog: > --------------- > #!/bin/bash > set -m > # vals to test: > declare -a vals=( A a B b X x Y y Z z Ⅷ Ⅴ Ⅲ Ⅰ Ⅿ Ⅽ ⅶ ⅼ ⅲ ) > COLLATE_ORDER=C > > function isatty { > local fd=${1:-1} ; > 0<&$fd tty -s > } > > function ord { > local nl=""; > isatty && nl="\n" > printf "%d$nl" "'$1" > } > > function background_print { > readarray -t inp > for ch in "${inp[@]}"; { > printf "%s (U+%x)\n" "$ch" "$(ord "$ch")" > } > } > > > printf "%s\n" "${vals[@]}" | > LC_COLLATE=$COLLATE_ORDER sort | > background_print > > ------------------------------------ > > Note, that the above produces: > > /tmp/stest > Ⅷ (U+2167) > Ⅴ (U+2164) > Ⅲ (U+2162) > Ⅰ (U+2160) > Ⅿ (U+216f) > Ⅽ (U+216d) > ⅶ (U+2176) > ⅼ (U+217c) > ⅲ (U+2172) > a (U+61) > A (U+41) > b (U+62) > B (U+42) > x (U+78) > X (U+58) > y (U+79) > Y (U+59) > z (U+7a) > Z (U+5a) > > NOT the output you showed...Seems there's a bug in the C collation order? Note C doesn't use a collation order, it's simple byte comparison. Seems there may be a bug in your script? Also ensure that LC_ALL is not set, which will override LC_COLLATE. $ printf "%s\n" A a B b 2 1 Ⅷ ⅶ ⅲ | LC_COLLATE=C sort 1 2 A B a b Ⅷ ⅲ ⅶ > > Changing collation order to UTF-8: > > Same thing: > /tmp/stest > Ⅷ (U+2167) > Ⅴ (U+2164) > Ⅲ (U+2162) > Ⅰ (U+2160) > Ⅿ (U+216f) > Ⅽ (U+216d) > ⅶ (U+2176) > ⅼ (U+217c) > ⅲ (U+2172) > a (U+61) > A (U+41) > b (U+62) > B (U+42) > x (U+78) > X (U+58) > y (U+79) > Y (U+59) > z (U+7a) > Z (U+5a) > > >>> I would assert this is a serious bug that should be addressed ASAP... >> >> As for the question in the subject for handling ranges in REs, >> there has been recent work in changing as you suggest: >> >> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105 > ---- > > Recent? ? > The most recent posts on that thread look to be from June of last year. > I.e. a year ago. > > I'm trying to stay focused on specific problems -- UTF-8 ordering is defined. > the gnu library doesn't follow it. > > Major problem with so many progs relying on the lib!... cheers, Pádraig.