Hi Gerd I tried various approaches to fixing "Find" when the fixed length Mdr17 (maybe also Mdr12) prefix contains sort.expand chars and couldn't make it work. I could documents these attempts in Sort.java if you feel this is worthwhile.
New patch attached that, for cp1252, leaves "ß" as its own PRIMARY after "s". Moved æ,Æ etc to be PRIMARIES on the grounds that their behaviour will be the same as "ß". Made cp1254 consistent as it had similar partial fixes. The main reason for the patch is to fix all the other sort/cp*.txt files that had line " > #" which was taken as a comment, resulting in "#" being ignored in collation. With the Display patch (sent previously, but also attached here), it can reproduce the resource/sort file from the binary SRT section. Ticker
Index: resources/sort/README =================================================================== --- resources/sort/README (revision 4856) +++ resources/sort/README (working copy) @@ -35,22 +35,24 @@ I believe that these are arbitary identifiers. Here is a registry of values we are using. If you make a variation on a code-page sort-order then give it a different id2 value. +It is believed that having sorts with the same id1/id2 but different data loaded +on the same device will give unexpected results -code-page id1 id2 +code-page id1 description -1250 12 1 -1251 8 1 -1252 7 2 -1253 13 1 -1254 14 1 -1255 15 1 -1256 16 1 -1257 17 1 -1258 18 1 -874 11 1 -932 9 1 -936 5 1 -949 10 1 +1250 12 Central European sort +1251 8 Cyrillic sort +1252 7 Western European sort +1253 13 Greek sort +1254 14 Turkish sort +1255 15 Hebrew sort +1256 16?9 Arabic sort cp1256.txt has id1=9, original version of this doc said 16 +1257 17 Latin Baltic sort +1258 18 Vietnamese sort +874 11 Thai. 8-bit not implemented +932 9 Japanese. Shift JIS not implemented. Note id1=9 used by 1256 +936 5 Simplified Chinese not implemented +949 10 Korean. Unified Hangui not implemented -65001 19 4 -0 0 0 +65001 19 Unicode sort +0 0 ASCII 7-bit sort Index: resources/sort/cp0.txt =================================================================== --- resources/sort/cp0.txt (revision 4856) +++ resources/sort/cp0.txt (working copy) @@ -1,9 +1,11 @@ codepage 0 id1 0 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "ASCII 7-bit sort" characters + =0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f,0001,0002,0003,0004,0005,0006,0007 < 0009 < 000a @@ -32,7 +34,7 @@ < / < \ < & - < # + < 0023 < % < ` < ^ @@ -79,3 +81,5 @@ < x,X < y,Y < z,Z + +# ends Index: resources/sort/cp1250.txt =================================================================== --- resources/sort/cp1250.txt (revision 4856) +++ resources/sort/cp1250.txt (working copy) @@ -1,9 +1,11 @@ codepage 1250 id1 12 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Central European sort" characters + =0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f=00ad,0001,0002,0003,0004,0005,0006,0007 < 0009 < 000a @@ -45,7 +47,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -120,3 +122,5 @@ expand ˛ to § 0020 expand ß to s s expand ™ to T M + +# ends Index: resources/sort/cp1251.txt =================================================================== --- resources/sort/cp1251.txt (revision 4856) +++ resources/sort/cp1251.txt (working copy) @@ -1,9 +1,11 @@ codepage 1251 id1 8 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Cyrillic sort" characters + =0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f=00ad,0001,0002,0003,0004,0005,0006,0007 < 0009 < 000a @@ -45,7 +47,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -152,7 +154,8 @@ < э,Э < ю,Ю < я,Я - expand … to . . . expand № to N o expand ™ to T M + +# ends Index: resources/sort/cp1252.txt =================================================================== --- resources/sort/cp1252.txt (revision 4856) +++ resources/sort/cp1252.txt (working copy) @@ -1,9 +1,7 @@ - - -# This must be first before any 'code' lines. codepage 1252 id1 7 -id2 2 +# 10-Jan-2022 Increment id2/version. Add comment about expansions. Move AE/ae/OE/oe +id2 3 description "Western European sort" characters @@ -96,7 +94,8 @@ < 7 < 8 < 9 - < a,A,ª ; á,Á ; à,À ; â, ; å,Å ; ä,Ä ; ã,à ; æ,Æ + < a,A,ª ; á,Á ; à,À ; â, ; å,Å ; ä,Ä ; ã,à + < æ,Æ < b,B < c,C ; ç,Ç < d,D ; ð,Ð @@ -111,7 +110,8 @@ < l,L < m,M < n,N ; ñ,Ñ - < o,O,º ; ó,Ó ; ò,Ò ; ô,Ô ; ö,Ö ; õ,Õ ; ø,Ø ; œ,Œ + < o,O,º ; ó,Ó ; ò,Ò ; ô,Ô ; ö,Ö ; õ,Õ ; ø,Ø + < œ,Œ < p,P < q,Q < r,R @@ -131,3 +131,14 @@ expand ½ to 1 / 2 expand ¾ to 3 / 4 expand ™ to T M +# expand cause "Find" problems with some Garmin devices. City search (and probably others) don't work +# when the composite char is in the short, fixed-length, MDR prefix that is PRIMARY unique. +# Disabling the following and putting the char, as PRIMARY, after its related char, improves matters. +# Leave the above because no method of inputting them anyway and unlikely at start of names. +#expand ß to s s +#expand Æ to A E +#expand æ to a e +#expand Œ to O E +#expand œ to o e + +# ends Index: resources/sort/cp1253.txt =================================================================== --- resources/sort/cp1253.txt (revision 4856) +++ resources/sort/cp1253.txt (working copy) @@ -1,6 +1,7 @@ codepage 1253 id1 13 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Greek sort" characters @@ -47,7 +48,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -140,3 +141,5 @@ expand … to . . . expand ½ to 1 / 2 expand ™ to T M + +# ends Index: resources/sort/cp1254.txt =================================================================== --- resources/sort/cp1254.txt (revision 4856) +++ resources/sort/cp1254.txt (working copy) @@ -1,10 +1,12 @@ codepage 1254 id1 14 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023. Move AE/ae/OE/oe/ß +id2 2 description "Turkish sort" characters -= 0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f=00ad,0001,0002,0003,0004,0005,0006,0007 + +=0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f=00ad,0001,0002,0003,0004,0005,0006,0007 < 0009 < 000a < 000b @@ -47,7 +49,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -92,7 +94,8 @@ < 7 < 8 < 9 - < a,A,ª ; á,Á ; à,À ; â, ; å,Å ; ä,Ä ; ã,à ; æ,Æ + < a,A,ª ; á,Á ; à,À ; â, ; å,Å ; ä,Ä ; ã,à + < æ,Æ < b,B < c,C ; ç,Ç < d,D @@ -108,11 +111,13 @@ < l,L < m,M < n,N ; ñ,Ñ - < o,O,º ; ó,Ó ; ò,Ò ; ô,Ô ; ö,Ö ; õ,Õ ; ø,Ø ; œ,Œ + < o,O,º ; ó,Ó ; ò,Ò ; ô,Ô ; ö,Ö ; õ,Õ ; ø,Ø + < œ,Œ < p,P < q,Q < r,R < s,S ; š,Š ; ş,Ş + < ß < t,T < u,U ; ú,Ú ; ù,Ù ; û,Û ; ü,Ü < v,V @@ -125,5 +130,12 @@ expand ¼ to 1 / 4 expand ½ to 1 / 2 expand ¾ to 3 / 4 -expand ß to s s expand ™ to T M +# see comment in ./cp1252.txt +#expand ß to s s +#expand Æ to A E +#expand æ to a e +#expand Œ to O E +#expand œ to o e + +# ends Index: resources/sort/cp1255.txt =================================================================== --- resources/sort/cp1255.txt (revision 4856) +++ resources/sort/cp1255.txt (working copy) @@ -1,6 +1,7 @@ codepage 1255 id1 15 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Hebrew sort" characters @@ -49,7 +50,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -157,3 +158,5 @@ expand װ to ו ו expand ױ to ו י expand ײ to י י + +# ends Index: resources/sort/cp1256.txt =================================================================== --- resources/sort/cp1256.txt (revision 4856) +++ resources/sort/cp1256.txt (working copy) @@ -1,176 +1,176 @@ - codepage 1256 id1 9 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Arabic sort" characters =0008=000e=000f=0010=0011=0012=0013=0014=0015=0016=0017=0018=0019=001a=001b=001c=001d=001e=001f=007f=200c=200d=00ad=ـ=200e=200f,0001,0002,0003,0004,0005,0006,0007 ; 064b ; 064c ; 064d ; 064e ; 064f ; 0650 ; 0651 ; 0652 -< 0009 -< 000a -< 000b -< 000c -< 000d -< 0020,00a0 -< _ -< - -< – -< — -< 002c -< ، -< 003b -< ؛ -< : -< ! -< ? -< ؟ -< . -< · -< ' -< ‘ -< ’ -< ‚ -< ‹ -< › -< " -< “ -< ” -< „ -< « -< » -< ( -< ) -< [ -< ] -< { -< } -< @ -< * -< / -< \ -< & -< # -< % -< ‰ -< † -< ‡ -< • -< ` -< ´ -< ^ -< ¯ -< ¨ -< ¸ -< § -< ¶ -< © -< ® -< ˆ -< ° -< + -< ± -< ÷ -< × -< 003c -< 003d -< > -< ¬ -< | -< ¦ -< ~ -< ¤ -< ¢ -< $ -< £ -< ¥ -< € -< 0 -< 1,¹ -< 2,² -< 3,³ -< 4 -< 5 -< 6 -< 7 -< 8 -< 9 -< a,A ; à ; â -< b,B -< c,C ; ç -< d,D -< e,E ; é ; è ; ê ; ë -< f,F -< ƒ -< g,G -< h,H -< i,I ; î ; ï -< j,J -< k,K -< l,L -< m,M -< n,N -< o,O ; ô -< p,P -< q,Q -< r,R -< s,S -< t,T -< u,U ; ù ; û ; ü -< v,V -< w,W -< x,X -< y,Y -< z,Z -< µ -< ء -< آ -< أ -< ؤ -< إ -< ئ -< ا -< ب -< پ -< ة -< ت -< ث -< ٹ -< ج -< چ -< ح -< خ -< د -< ذ -< ڈ -< ر -< ز -< ڑ -< ژ -< س -< ش -< ص -< ض -< ط -< ظ -< ع -< غ -< ف -< ق -< ك -< ک -< گ -< ل -< م -< ن -< ں -< ه -< ھ -< ہ -< و -< ى -< ي -< ے + < 0009 + < 000a + < 000b + < 000c + < 000d + < 0020,00a0 + < _ + < - + < – + < — + < 002c + < ، + < 003b + < ؛ + < : + < ! + < ? + < ؟ + < . + < · + < ' + < ‘ + < ’ + < ‚ + < ‹ + < › + < " + < “ + < ” + < „ + < « + < » + < ( + < ) + < [ + < ] + < { + < } + < @ + < * + < / + < \ + < & + < 0023 + < % + < ‰ + < † + < ‡ + < • + < ` + < ´ + < ^ + < ¯ + < ¨ + < ¸ + < § + < ¶ + < © + < ® + < ˆ + < ° + < + + < ± + < ÷ + < × + < 003c + < 003d + < > + < ¬ + < | + < ¦ + < ~ + < ¤ + < ¢ + < $ + < £ + < ¥ + < € + < 0 + < 1,¹ + < 2,² + < 3,³ + < 4 + < 5 + < 6 + < 7 + < 8 + < 9 + < a,A ; à ; â + < b,B + < c,C ; ç + < d,D + < e,E ; é ; è ; ê ; ë + < f,F + < ƒ + < g,G + < h,H + < i,I ; î ; ï + < j,J + < k,K + < l,L + < m,M + < n,N + < o,O ; ô + < p,P + < q,Q + < r,R + < s,S + < t,T + < u,U ; ù ; û ; ü + < v,V + < w,W + < x,X + < y,Y + < z,Z + < µ + < ء + < آ + < أ + < ؤ + < إ + < ئ + < ا + < ب + < پ + < ة + < ت + < ث + < ٹ + < ج + < چ + < ح + < خ + < د + < ذ + < ڈ + < ر + < ز + < ڑ + < ژ + < س + < ش + < ص + < ض + < ط + < ظ + < ع + < غ + < ف + < ق + < ك + < ک + < گ + < ل + < م + < ن + < ں + < ه + < ھ + < ہ + < و + < ى + < ي + < ے expand … to . . . expand ¼ to 1 / 4 @@ -179,3 +179,5 @@ expand œ to o e expand Œ to O E expand ™ to T M + +# ends Index: resources/sort/cp1257.txt =================================================================== --- resources/sort/cp1257.txt (revision 4856) +++ resources/sort/cp1257.txt (working copy) @@ -1,6 +1,7 @@ codepage 1257 id1 17 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Latin Baltic sort" characters @@ -46,7 +47,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -127,3 +128,5 @@ expand Æ to A E expand ß to s s expand ™ to T M + +# ends Index: resources/sort/cp1258.txt =================================================================== --- resources/sort/cp1258.txt (revision 4856) +++ resources/sort/cp1258.txt (working copy) @@ -1,6 +1,7 @@ codepage 1258 id1 18 -id2 1 +# 10-Jan-2022 Increment id2/version. Fix '#' to 0023 +id2 2 description "Vietnamese sort" characters @@ -48,7 +49,7 @@ < / < \ < & - < # + < 0023 < % < ‰ < † @@ -132,3 +133,5 @@ expand Œ to O E expand ß to s s expand ™ to T M + +# ends Index: resources/sort/cp65001.txt =================================================================== --- resources/sort/cp65001.txt (revision 4856) +++ resources/sort/cp65001.txt (working copy) @@ -1,3 +1,7 @@ +# use extra/src/uk/me/parabola/util/CollationRules.java to generate some of the tables. +# This uses https://www.unicode.org/Public/UCA/latest/allkeys.txt +# see https://www.mkgmap.org.uk/pipermail/mkgmap-dev/2021q4/033096.html + codepage 65001 id1 19 id2 4 @@ -11133,3 +11137,5 @@ expand ㍕ to れ む expand ㍖ to れ ん と こ ん expand ㍗ to ゎ っ と + +# ends
Index: src/test/display/SrtDisplay.java =================================================================== --- src/test/display/SrtDisplay.java (revision 580) +++ src/test/display/SrtDisplay.java (working copy) @@ -57,6 +57,11 @@ private final Map<Integer, Integer> offsetToBlock = new HashMap<>(); + private String srtDescription; + private int codepage; + private int id1; + private int id2; + protected void print() { readCommonHeader(); readFileHeader(); @@ -119,9 +124,9 @@ d.setTitle("Description"); - String s = d.zstringValue("Description: %s"); + srtDescription = d.zstringValue("Description: %s"); - long remain = description.getLen() - s.length() - 1; + long remain = description.getLen() - srtDescription.length() - 1; d.rawValue((int) remain); d.print(outStream); @@ -138,10 +143,10 @@ d.setSectStart(start); reader.position(start); int len = d.charValue("sub header len %d"); - d.charValue("id1 %d"); - d.charValue("id2 %d"); + id1 = d.charValue("id1 %d"); + id2 = d.charValue("id2 %d"); - int codepage = d.charValue("codepage %d"); + codepage = d.charValue("codepage %d"); if (codepage == 65001) isUnicode = true; Charset charset = Sort.charsetFromCodepage(codepage); @@ -206,38 +211,82 @@ d.setTitle("------- Summary of ordering --------"); Formatter chars = new Formatter(); - Formatter comment = new Formatter(); + //Formatter comment = new Formatter(); + + // reproduce header like mkgmap resource/sort/cp*.txt entries + chars.format("\n\n\n"); + chars.format("# Compare this with resource/sort/cp%d.txt.\n\n", codepage); + chars.format("codepage %d\n", codepage); + chars.format("id1 %d\n", id1); + chars.format("id2 %d\n", id2); + chars.format("description \"%s\"\n\n", srtDescription); + chars.format("characters\n\n"); + CharPosition last = new CharPosition(0); - last.first = -1; + //last.first = -1; + last.first = 0; // start first line with zero/ignore sortOrder for (CharPosition cp : charmap) { - if (cp.expands) + if (cp.expands > 0) continue; + int unicodeChar = toUnicode(cp.val); + if (unicodeChar < 0) // no character defined for this position + continue; if (cp.first != last.first) { //chars.format(" # %s\n[%d] < ", comment, cp.first); - chars.format("\n< "); - comment = new Formatter(); + chars.format("\n < "); + //comment = new Formatter(); } else if (cp.second != last.second) { chars.format(" ; "); - comment.format(" ; "); + //comment.format(" ; "); } else if (cp.third != last.third) { chars.format(","); - comment.format(","); + //comment.format(","); } else { chars.format("="); - comment.format("="); + //comment.format("="); } last = cp; - chars.format("%s", fmtChar(toUnicode(cp.val))); - comment.format("U+%04x", cp.val); + chars.format("%s", fmtChar(unicodeChar)); + //comment.format("U+%04x", cp.val); } chars.format("\n"); for (CharPosition cp : charmap) { - if (cp.expands) - continue; - chars.format("%4s %s\n", fmtChar(toUnicode(cp.val)), cp); + if (cp.expands > 0) { + chars.format("expand %s to ", fmtChar(toUnicode(cp.val))); + for (int i = 0; i <= cp.expands; ++i) { + CharPosition ch = expansions.get(cp.first + i - 1); + // need to search for best char with this first/primary. Doesn't actually matter + // apart from the cosmetics of the sort/cp*.txt expand list because the secondary + // and tertiary binary sortOrders are chosen to avoid matching existing real chars. + // see mkgmap/srt/SrtTextReader.java for more info + if (ch.second > 7) + ch.second -= 7; + ch.third = ch.third >= 5 ? 2 : 1; + int charValue = -1; + for (CharPosition scanCp : charmap) { + if (scanCp.expands > 0) + continue; + if (scanCp.first == ch.first) { + if (scanCp.second == ch.second && + scanCp.third == ch.third) { + charValue = scanCp.val; + break; + } else if (charValue < 0) { + charValue = scanCp.val; + } + } + } + if (charValue >= 0) + charValue = toUnicode(charValue); + if (charValue >= 0) + chars.format(" %c", charValue); + } + chars.format("\n"); + } } + chars.format("\n# ends\n", codepage); d.item().addText(chars.toString()); d.print(outStream); @@ -286,7 +335,11 @@ StringBuilder sb = new StringBuilder(); Formatter fmt = new Formatter(sb); fmt.format("0x%02x ", charValue); - fmt.format("(%c) ", toUnicode(charValue)); + int unicodeChar = toUnicode(charValue); + if (unicodeChar < 0) // no character defined for this position + fmt.format("NaC "); + else + fmt.format("(%c) ", unicodeChar); if ((flags & 0x1) != 0) sb.append("Letter "); if ((flags & 0x2) != 0) @@ -297,8 +350,8 @@ } else { // This is an expansion, it sorts as two or more characters (eg ß sorts near ss). // The pos is an index into srt5. - c.expands = true; - expansion(sb, c.first, (flags >> 4) & 0xf); + c.expands = (flags >> 4) & 0xf; + expansion(sb, c.first, c.expands); } item.addText(sb.toString()); @@ -373,7 +426,7 @@ CharBuffer chars = decoder.decode(b); return chars.charAt(0); } catch (CharacterCodingException e) { - return '?'; + return -1; } } @@ -472,7 +525,7 @@ private int first; private int second; private int third; - private boolean expands; + private int expands; public CharPosition(int charValue) { this.val = charValue;
_______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev