Re: [mkgmap-dev] Format6Encoder/Decoder

Ticker Berkin Wed, 01 Dec 2021 02:23:11 -0800

Hi Gerd

I'll investigate index problems and Format6 a bit next week to see what
MapInstall does and if there is a correct way to represent the strings
that Mdr places, or to say what encoding they are in.


Fixing, extending and simplifying the Format6Decoder is worthwhile,
along with the CodeFunctionsTest changes, so I've made another patch
that rewords the OPTIONS and stops the Format6Encoder doing lower-case,
but leaves the mechanisms there. It will encode the extra chars "`{|}".

Ticker

On Tue, 2021-11-30 at 15:15 +0000, Gerd Petermann wrote:
> Hi Ticker,
> 
> with r4821 I can still reproduce search problems with --code-page=0
> when road names start with symbols ("@Road" or "#Street") (in
> MapSource). I guess the shift character is the problem.
> With your patch it is much more likely that the first 4 characters
> contain such a shift byte.
> Or maybe the sort order for these is wrong?
> 
> Did not try to find a solution for this and I have no Garmin map that
> uses 6bit encoding. Possibly we just can skip writing mdr17 as we do
> with unicode.
> 
> A few things are really confusing reg. the --code-page and --charset
> option.
> 1) if you specify e.g. --code-page=cp1252 or --code-page=ms932 mkgmap
> will silently change that to cp0. See getCodePage() in CommandArgs.
> 2) the option --charset is deprecated but still evaluated. Something
> like --charset=cp1252 --code-page=1254 probably causes trouble.
> 
> Gerd
> 
> ________________________________________
> Von: mkgmap-dev <mkgmap-dev-boun...@lists.mkgmap.org.uk> im Auftrag
> von Gerd Petermann <gpetermann_muenc...@hotmail.com>
> Gesendet: Dienstag, 30. November 2021 14:21
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] Format6Encoder/Decoder
> 
> Hi Ticker,
> 
> seems you are partly right. I created a map with cp0 and --lower-case
> and the search for road names starting with "augs" doesn't return
> Augsburger Strasse on the device.
> Without -lower-case this works as expected. I've reverted r4820 for
> now, but maybe the combination never works well. Needs more testing.
> 
> Gerd
> 
> ________________________________________
> Von: mkgmap-dev <mkgmap-dev-boun...@lists.mkgmap.org.uk> im Auftrag
> von Ticker Berkin <rwb-mkg...@jagit.co.uk>
> Gesendet: Dienstag, 30. November 2021 12:15
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] Format6Encoder/Decoder
> 
> Hi Gerd
> 
> Building a map with --code-page=0 --index --gmapsupp --gmapi
> (exactly same behaviour without --code-page=0)
> 
> Then, using raw mode editor to look at various components.
> 
> The LBL section of tiles, gmapsupp.img and gmapi looks encoded - I
> can't find any recognisable strings.
> 
> However, the gmapsupp contains the 4-char prefixes (as per Mdr17)
> with
> standard ascii encoding, the gmapi 00OSMMAP.MDR contains full names
> of
> everything, again as ascii (probably Mdr15) and .typ file is also
> ascii.
> 
> I can't find any ascii in the overview map (except the expected
> subfile
> names and copyright), but my test area might not have anything named
> at
> this level.
> 
> It is possible that MDR does this intentionally; avoiding the
> "compressed" format that Mdr15.java / MdrDisplay mentions. This
> compressed format might simply be Format6
> 
> Ticker
> 
> On Tue, 2021-11-30 at 10:20 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> > 
> > I also don't like that we still use e.g.  Charset.forName("ascii")
> > instead of StandardCharsets.US_ASCII, esp. because "ascii" is also
> > used as name for our own resource files.
> > 
> > In what situation do you see wrong encoding of Mdr15/Mdr17?
> > 
> > Gerd
> 
> 
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev@lists.mkgmap.org.uk
> https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev@lists.mkgmap.org.uk
> https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> _______________________________________________
> mkgmap-dev mailing list
> mkgmap-dev@lists.mkgmap.org.uk
> https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Index: doc/options.txt
===================================================================
--- doc/options.txt	(revision 4823)
+++ doc/options.txt	(working copy)
@@ -145,11 +145,10 @@
 === Label options ===
 
 ;--code-page=number
-:     Specify which international character set is to be used. Only 8 bit
-character sets are supported so you have to specify which code page you
-want to use.
-It is entirely dependent on the device firmware which code pages are
-supported.
+:     Specify which character set is to be used.
+The default is --code-page=0 which uses a 6-bit encoding of ASCII characters.
+Mkgmap supports many 8-bit code-pages and Unicode. Other, multi-byte, code-pages might work.
+It is entirely dependent on the device firmware which code pages they support.
 
 ;--latin1
 : 	This is equivalent to --code-page=1252.
@@ -159,8 +158,10 @@
 Unicode maps produced by mkgmap.
 
 ;--lower-case
-: 	Allow labels to contain lower case letters.  Note that many
-Garmin devices are not able to display lower case letters at an angle.
+: 	Allow labels to contain lower case letters.
+Note that some older Garmin devices are not able to display lower case letters at an angle.
+Also note that newer devices InitCap text, so, even without this option, text appears as mixed case.
+This option is ignored when using the default code-page=0.
 
 === Address search options ===
 
Index: resources/help/en/options
===================================================================
--- resources/help/en/options	(revision 4823)
+++ resources/help/en/options	(working copy)
@@ -145,10 +145,10 @@
 === Label options ===
 
 --code-page=number
-    Specify which international character set is to be used. Only 8 bit
-    character sets are supported so you have to specify which code page you
-    want to use. It is entirely dependent on the device firmware which code
-    pages are supported.
+    Specify which character set is to be used. The default is --code-page=0
+    which uses a 6-bit encoding of ASCII characters. Mkgmap supports many 8-bit
+    code-pages and Unicode. Other, multi-byte, code-pages might work. It is
+    entirely dependent on the device firmware which code pages they support.
 
 --latin1
     This is equivalent to --code-page=1252.
@@ -158,8 +158,10 @@
     support Unicode maps produced by mkgmap.
 
 --lower-case
-    Allow labels to contain lower case letters. Note that many Garmin devices
-    are not able to display lower case letters at an angle.
+    Allow labels to contain lower case letters. Note that some older Garmin
+    devices are not able to display lower case letters at an angle. Also note
+    that newer devices InitCap text, so, even without this option, text appears
+    as mixed case. This option is ignored when using the default code-page=0.
 
 === Address search options ===
 
Index: src/uk/me/parabola/imgfmt/app/labelenc/BaseEncoder.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/BaseEncoder.java	(revision 4823)
+++ src/uk/me/parabola/imgfmt/app/labelenc/BaseEncoder.java	(working copy)
@@ -28,7 +28,7 @@
  * @author Steve Ratcliffe
  */
 public class BaseEncoder {
-	private static final Logger log = Logger.getLogger(BaseEncoder.class);
+	protected static final Logger log = Logger.getLogger(BaseEncoder.class);
 
 	public static final EncodedText NO_TEXT = new EncodedText(null, 0, null);
 
Index: src/uk/me/parabola/imgfmt/app/labelenc/Format6Decoder.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/Format6Decoder.java	(revision 4823)
+++ src/uk/me/parabola/imgfmt/app/labelenc/Format6Decoder.java	(working copy)
@@ -83,22 +83,10 @@
 		if (symbol) {
 			symbol = false;
 			c = Format6Encoder.SYMBOLS.charAt(b);
-		}
-		else if(lowerCaseOrSeparator) {
+		} else if (lowerCaseOrSeparator) {
 			lowerCaseOrSeparator = false;
-			if(b == 0x2b || b == 0x2c) {
-				c = (char)(b - 0x10); // "thin" separator
-			}
-			else if(Character.isLetter(b)) {
-				// lower case letter
-				c = Character.toLowerCase(Format6Encoder.LETTERS.charAt(b));
-			}
-			else {
-				// not a letter so just use as is (could be a digit)
-				c = Format6Encoder.LETTERS.charAt(b);
-			}
-		}
-		else {
+			c = Format6Encoder.LOWERCASE.charAt(b);
+		} else {
 			switch(b) {
 			case 0x1B:
 				// next char is lower case or a separator
@@ -110,13 +98,6 @@
 				symbol = true;
 				return;
 
-			case 0x1D:
-			case 0x1E:
-			case 0x1F:
-				// these are separators - use as is
-				c = (char)b;
-				break;
-
 			default:
 				c = Format6Encoder.LETTERS.charAt(b);
 				break;
Index: src/uk/me/parabola/imgfmt/app/labelenc/Format6Encoder.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/Format6Encoder.java	(revision 4823)
+++ src/uk/me/parabola/imgfmt/app/labelenc/Format6Encoder.java	(working copy)
@@ -17,12 +17,11 @@
 package uk.me.parabola.imgfmt.app.labelenc;
 
 import java.text.Normalizer;
-import java.util.Locale;
 
 /**
- * Format according to the '6 bit' .img format.  The text is first upper
- * cased.  Any letter with a diacritic or accent is replaced with its base
- * letter.
+ * Format according to the '6 bit' .img format.
+ * Any letter with a diacritic or accent is replaced with its base letter.
+ * Characters from other alphabets are transliterated if resources/chars/ascii/ data exists.
  *
  * For example Körnerstraße would become KORNERSTRASSE,
  * Řípovská would become RIPOVSKA etc.
@@ -32,25 +31,44 @@
  *
  * @author Steve Ratcliffe
  * @see <a href="http://garmin-img.sf.net";>Garmin IMG File Format</a>
+ *
+ * Although Format6 supports lower-case, the default is to forceUpper and ignore request
+ * to change to mixed-case.
+ * The main reason for this is that indexed searching for streets has been found not to
+ * work in some cases.
+ * Another reason is that each lower-case letter needs 12 bits, so, with typical OSM data,
+ * almost any other code-page will be more compact.
  */
 public class Format6Encoder extends BaseEncoder implements CharacterEncoder {
 
-	// This is 0x1b is the source document, but the accompanying code uses
-	// the value 0x1c, which seems to work.
-	private static final int SYMBOL_SHIFT = 0x1c;
+	// Following are swapped in the above John Mechalas document, but this is what works:
+	private static final int LOWERCASE_SHIFT = 0x1b;
+	private static final int SYMBOL_SHIFT    = 0x1c;
 
 	public static final String LETTERS =
 		" ABCDEFGHIJKLMNO" +	// 0x00-0x0F
-		"PQRSTUVWXYZxx   " +	// 0x10-0x1F
-		"0123456789\u0001\u0002\u0003\u0004\u0005\u0006";	// 0x20-0x2F
+		"PQRSTUVWXYZxx\u001d\u001e\u001f" +	// 0x10-0x1F  xx are above SHIFTs. prefix/suffix indicators
+		"0123456789\u0001\u0002\u0003\u0004\u0005\u0006";	// 0x20-0x2F  digits + shields
 
 	public static final String SYMBOLS =
 		"@!\"#$%&'()*+,-./" +	// 0x00-0x0F
-		"xxxxxxxxxx:;<=>?" +	// 0x10-0x1F
-		"xxxxxxxxxxx[\\]^_";	// 0x20-0x2F
+		"          :;<=>?" +	// 0x10-0x1F
+		"°          [\\]^_";	// 0x20-0x2F
+	//   ^ looks like degree (\u00b0) on MapSource/eTrex. Won't happen as transliterated to "deg"
+	//   0123456789abcdef
+	public static final String LOWERCASE =
+		"`abcdefghijklmno" +	// 0x00-0x0F  back-tick
+		"pqrstuvwxyz{|}~ " +	// 0x10-0x1F
+		"           \u001b\u001c   ";     // 0x20-0x2F  more prefix/suffix indicators
 
-	private final Transliterator transliterator = new TableTransliterator("ascii");
+	private final Transliterator transliterator;
 
+	public Format6Encoder() {
+		transliterator = new TableTransliterator("ascii");
+		transliterator.forceUppercase(true);  // default to upper case
+		super.setUpperCase(true);
+	}
+
 	/**
 	 * Encode the text into the 6 bit format.  See the class level notes.
 	 *
@@ -62,7 +80,7 @@
 		if (text == null || text.isEmpty())
 			return NO_TEXT;
 		String normalisedText = Normalizer.normalize(text, Normalizer.Form.NFC);
-		String s = transliterator.transliterate(normalisedText).toUpperCase(Locale.ENGLISH);
+		String s = transliterator.transliterate(normalisedText);  // it does the upper if forceUpper
 
 		// Allocate more than enough space on average for the label.
 		// if you overdo it then it will waste a lot of space , but
@@ -78,8 +96,8 @@
 				put6(buf, off++, c - 'A' + 1);
 			} else if (c >= '0' && c <= '9') {
 				put6(buf, off++, c - '0' + 0x20);
-			} else if (c == 0x1b || c == 0x1c) {
-				put6(buf, off++, 0x1b);
+			} else if (c == 0x1b || c == 0x1c) {  // shiftedLowerCase() does same thing
+				put6(buf, off++, LOWERCASE_SHIFT);
 				put6(buf, off++, c + 0x10);
 			} else if (c >= 0x1d && c <= 0x1f) {
 				put6(buf, off++, c);
@@ -86,8 +104,14 @@
 			} else if (c >= 1 && c <= 6) {
 				// Highway shields
 				put6(buf, off++, 0x29 + c);
+			} else if (c >= 'a' && c <= 'z') {
+				put6(buf, off++, LOWERCASE_SHIFT);
+				put6(buf, off++, c - 'a' + 1);
 			} else {
+				int rememberOff = off;
 				off = shiftedSymbol(buf, off, c);
+				if (off == rememberOff)
+					off = shiftedLowerCase(buf, off, c);
 			}
 		}
 
@@ -119,6 +143,16 @@
 		return off;
 	}
 
+	private int shiftedLowerCase(byte[] buf, int startOffset, char c) {
+		int off = startOffset;
+		int ind = LOWERCASE.indexOf(c);
+		if (ind >= 0) {
+			put6(buf, off++, LOWERCASE_SHIFT);
+			put6(buf, off++, ind);
+		}
+		return off;
+	}
+
 	/**
 	 * Each character is packed into 6 bits.  This keeps track of everything so
 	 * that the character can be put into the right place in the byte array.
@@ -149,4 +183,11 @@
 
 		return buf;
 	}
+
+	@Override
+	public void setUpperCase(boolean upperCase) {
+		// Ignore requests to allow mixed case
+		//super.setUpperCase(upperCase);
+		//transliterator.forceUppercase(upperCase);
+	}
 }
Index: test/uk/me/parabola/imgfmt/app/labelenc/CodeFunctionsTest.java
===================================================================
--- test/uk/me/parabola/imgfmt/app/labelenc/CodeFunctionsTest.java	(revision 4823)
+++ test/uk/me/parabola/imgfmt/app/labelenc/CodeFunctionsTest.java	(working copy)
@@ -28,6 +28,7 @@
 		assertEquals("code page", 0, functions.getCodepage());
 		assertEquals("encoding type", 6, functions.getEncodingType());
 		CharacterEncoder enc = functions.getEncoder();
+		((BaseEncoder) enc).setUpperCase(true);
 
 		EncodedText etext = enc.encodeText("hello world");
 		byte[] ctext = etext.getCtext();
@@ -58,8 +59,13 @@
 		CodeFunctions functions = CodeFunctions.createEncoderForLBL(6, 0);
 
 		CharacterEncoder encoder = functions.getEncoder();
+		((BaseEncoder) encoder).setUpperCase(true);
+
 		// Twülpstedt contains u + "COMBINING DIAERESIS" (0x75 + 0x308)
-		EncodedText text = encoder.encodeText("Körnerstraße, Twülpstedt, Velkomezeříčská, Skólavörðustigur");
+		String tstStr = "Körnerstraße, Twülpstedt, Velkomezeříčská, Skólavörðustigur";
+		//              "12345678901234567890123456789012345678901234567890123456789"
+		assertEquals("tstStr length", 60, tstStr.length()); // check COMBINING DIAERESIS not already combined
+		EncodedText text = encoder.encodeText(tstStr);
 
 		CharacterDecoder decoder = functions.getDecoder();
 		byte[] ctext = text.getCtext();
@@ -70,6 +76,7 @@
 		}
 		String result = decoder.getText().getText();
 		assertEquals("transliterated text", "KORNERSTRASSE, TWULPSTEDT, VELKOMEZERICSKA, SKOLAVORDUSTIGUR", result);
+		assertEquals("result length", 60, result.length()); // ß changed to ss, COMBINING DIAERESIS combined by normalisation
 	}
 
 	/**
@@ -81,8 +88,11 @@
 		CodeFunctions functions = CodeFunctions.createEncoderForLBL("latin1");
 
 		CharacterEncoder encoder = functions.getEncoder();
+		((BaseEncoder) encoder).setUpperCase(false);
 		// Twülpstedt contains u + "COMBINING DIAERESIS" (0x75 + 0x308)
-		EncodedText text = encoder.encodeText("Körnerstraße, Twülpstedt, Velkomezeříčská, Skólavörðustigur");
+		String tstStr = "Körnerstraße, Twülpstedt, Velkomezeříčská, Skólavörðustigur";
+		assertEquals("tstStr length", 60, tstStr.length()); // check COMBINING DIAERESIS not already combined
+		EncodedText text = encoder.encodeText(tstStr);
 
 		CharacterDecoder decoder = functions.getDecoder();
 		byte[] ctext = text.getCtext();
@@ -95,6 +105,7 @@
 		String result = decoder.getText().getText();
 		// Twülpstedt now contains LATIN SMALL LETTER U WITH DIAERESIS (u+00fc)
 		assertEquals("transliterated text", "Körnerstraße, Twülpstedt, Velkomezerícská, Skólavörðustigur", result);
+		assertEquals("result length", 59, result.length()); // COMBINING DIAERESIS combined by normalisation
 	}
 
 	/**
@@ -112,6 +123,7 @@
 		}
 
 		CharacterEncoder encoder = functions.getEncoder();
+		((BaseEncoder) encoder).setUpperCase(false);
 		EncodedText text = encoder.encodeText(sb.toString());
 
 		// This encoder appends a null byte.

_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Re: [mkgmap-dev] Format6Encoder/Decoder

Reply via email to