Hi Gerd
 
For any code-page except Japanese/cp932, AnyCharSetEncoder takes
anything that can't be represented, tries to find a reasonable ascii
representation or "?", then writes this to the output. This is a big
assumption for far-eastern charsets, most likely generating garbage
with possible invalid shift-in/out requests...

SparseTranslitorator is a very strange special case, without any
explanation. Doing a bit of searching, it was submitted as a change
because user had map that needed to be in Japanese/cp932 and it also
contained latin characters. The characters with macrons couldn't be
encoded. Many others could. The rest of Unicode that can't be encoded
resulted in garbage.

Your patch fixes the "rest of Unicode" problem for cp932. It misses any
ability of the 'latin1' transliterator to provide reasonable
replacement chars that can be encoded. It doesn't deal with possible
problems for other (non-european) charsets.

I've attached cs932-V3.patch that addresses both of these issues.

SparseTranslitorator.java can the be removed.

Ticker

On Wed, 2021-11-17 at 18:00 +0000, Gerd Petermann wrote:
> Hi Ticker,
> 
> > For some other character sets the result could be invalid or
> > garbage.
> OK, I assumed that '?' is always at the same position, might be wrong
> with that.
> SparseTransliterator is only used for cs932.
> 
> Gerd

Index: src/uk/me/parabola/imgfmt/app/labelenc/AnyCharsetEncoder.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/AnyCharsetEncoder.java	(revision 4817)
+++ src/uk/me/parabola/imgfmt/app/labelenc/AnyCharsetEncoder.java	(working copy)
@@ -80,8 +80,9 @@
 				if (result.length() == 1) {
 					s0 = String.valueOf(charBuffer.get());
 				} else {
-					// Don't know under what circumstances this will be called and may not be the
-					// correct thing to do when it does happen.
+					// Probably a UTF-16 surrogate pair (represents a single unicode point)
+					// Handle as the general case, but the translitorator will need a extra code and
+					// tables to handle these.
 					StringBuilder sb = new StringBuilder();
 					for (int i = 0; i < result.length(); i++)
 						sb.append(charBuffer.get());
@@ -90,18 +91,27 @@
 				}
 
 				String s = transliterator.transliterate(s0);
-
-				// Make sure that there is enough space for the transliterated string
-				while (outBuf.limit() < outBuf.position() + s.length())
-					outBuf = reallocBuf(outBuf);
-
-				if (s.equals(s0)) {
-					// string is still unmappable
-					outBuf.put(encoder.replacement()); //typically '?'
+				if (s.equals(s0)) { // string is still unmappable
+					// however, at moment, TableTranslitorator returns a "?" so won't happen
+					// Make sure that there is enough space for the replacement
+					while (outBuf.limit() < outBuf.position() + encoder.replacement().length)
+						outBuf = reallocBuf(outBuf);
+					outBuf.put(encoder.replacement()); // typically '?'
 				} else {
-					for (int i = 0; i < s.length(); i++) {
-						outBuf.put((byte) s.charAt(i));
-					}
+					// some transliteration has happened so see if this can be encoded
+					CharBuffer translitBuffer = CharBuffer.wrap(s);
+					do {
+						result = encoder.encode(translitBuffer, outBuf, true);
+						if (result.isUnmappable()) { // some can't be encoded
+							while (outBuf.limit() < outBuf.position() + encoder.replacement().length)
+								outBuf = reallocBuf(outBuf);
+							outBuf.put(encoder.replacement());
+							break;  // and give up with this inner part
+						} else if (result == CoderResult.OVERFLOW) {
+							// Ran out of space in the output
+							outBuf = reallocBuf(outBuf);
+						}
+					} while (result != CoderResult.UNDERFLOW);
 				}
 
 			} else if (result == CoderResult.OVERFLOW) {
Index: src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java	(revision 4817)
+++ src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java	(working copy)
@@ -101,10 +101,17 @@
 		case "cp932":
 		case "ms932":
 			funcs.setEncodingType(ENCODING_FORMAT10);
-			funcs.setEncoder(new AnyCharsetEncoder("ms932", new SparseTransliterator("nomacron")));
+			//funcs.setEncoder(new AnyCharsetEncoder("ms932", new SparseTransliterator("nomacron")));
+			// Note: above was a strange special case - it just removed macrons because it was known that
+			// cp932 didn't support them.
+			// However, there are many other unicode values that cp932 doesnt support so use more
+			// intelligent login in AnyCharSetCoder to find more complete tables of mapping from latin1
+			funcs.setEncoder(new AnyCharsetEncoder("ms932", new TableTransliterator("latin1")));
 			funcs.setDecoder(new AnyCharsetDecoder("ms932"));
 			funcs.setCodepage(932);
 			break;
+	//	case "other far-eastern code pages":
+			// probably ENCODING_FORMAT10 if variable width and generally like above
 		default:
 			funcs.setEncodingType(ENCODING_FORMAT9);
 			funcs.setDecoder(new AnyCharsetDecoder(charset));
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Reply via email to