Re: [ntfs-3g-devel] Is there a way to ignore "Invalid or incomplete multibyte or wide character"

Erik Larsson Thu, 07 Apr 2016 02:04:00 -0700

Hi,

On 2016-04-06 19:22, Jean-Pierre André wrote:

Erik Larsson wrote:

You are very right, but the upside is that listing the directory at
least works (with the exception of the files with the bad filenames) as
opposed to aborting with error as soon as a bad filename is encountered.


So we are more error-tolerant with this patch... I think this is a good
thing given that chkdsk doesn't appear to make any efforts at repairing
this filename (it doesn't think there is any corruption on this
particular volume... tested with WinXP's chkdsk and Win8's).

Manufacturing a fake UTF-8 file name as a handle just to be able to
access these corrupted UTF-16 filenames seems overly complex for this
case... taking into account possible name collisions and such.


I agree, this is a slippery road, and your proposal
will save time dealing with rare issues.

I have a proposal that would enable accessing these broken files inntfs-3g and the progs. The proposal involves encoding broken surrogateUTF-16 units into their own separate 3-byte UTF-8 sequences. This issometimes referred to by the acronym WTF-8 (see:https://en.wikipedia.org/wiki/UTF-8#WTF-8 ).

The effect is that these files aren't ignored as in the previousproposed patch but are included in the listing and can be looked up asany other file since encoding broken UTF-16 to WTF-8 and then back tobroken UTF-16 is lossless, though the UTF-8 byte sequences returned touser aren't fully Unicode compliant.However I think this is the best we can do without starting tomanufacture fake file names for these entries with all that complexity.


Please review the attached patch.

Best regards,

- Erik

On 2016-04-06 18:14, Jean-Pierre André wrote:

Hi Erik,

Your patch will help for examining the directory, but
IMHO you will not be able the read, delete or rename
the bad file, because you will have to enter a uts8
name which will not translate to the bad Unicode for
accessing the file. Even if you use wildcards, ntfs-3g
only get requests with utf8 names.

When accessing the directory, you will however get the
inode number to retrieve the contents using ntfscat.

Regards

Jean-Pierre

Erik Larsson wrote:

Hi,

Attached to this email is a patch which does just what I suggested...
emitting a log message but proceeding normally and ignoring the entry
when a bad filename is encountered during readdir. This fixes the
problem for me.

Jean-Pierre, please review and decide whether this is a good idea.

Best regards,

- Erik

On 2016-04-06 17:27, Erik Larsson wrote:

Hi,

I looked into this image and noticed that there are 4 filenames in
/WINDOWS/system32 that cannot be decoded.

One example is the MFT entry 30661 with the filename (as UTF-16
units): 0xDE5C 0xDC93 0x002E 0x006C 0x006F 0x0067
The filename ends with '.log' but the first two UTF-16 units is where

Unicode decoding blows up. 0xDE5C is the low value of a surrogatepair

according to Wikipedia (range: 0xDC00-0xDFFF). We are expecting the
high value (0xD800-0xDBFF) to come first.
It is then followed by another low value of a surrogate pair, 0xDC93.
This is clearly a corruption... a surrogate pair should consist of a
high value followed by a low value.

I have no idea how this file was created... if Windows did this, then
we might need to be able to cope with such corruption better (e.g.
ignoring the entry during readdir and just emit a log message).

Best regards,

- Erik

On 2016-04-06 13:06, Richard W.M. Jones wrote:

The reporter kindly gave me permission to distribute the metadata
file.  I've put it up here:

   http://oirase.annexia.org/tmp/bz1301593/

   $ md5sum ntfsclone_sda2.xz
   6cadc64de3196311c8159dc12f84484c  ntfsclone_sda2.xz

Rich.

>From 62783df3936d7d60d4c2629d3d6253137911054e Mon Sep 17 00:00:00 2001
From: Erik Larsson <mec...@users.sourceforge.net>
Date: Thu, 7 Apr 2016 10:51:09 +0200
Subject: [PATCH] unistr.c: Enable encoding broken UTF-16 into broken UTF-8,
 A.K.A. WTF-8.

Windows filenames may contain invalid UTF-16 sequences (specifically
broken surrogate pairs), which cannot be converted to UTF-8 if we do
strict conversion.

This patch enables encoding broken UTF-16 into similarly broken UTF-8 by
encoding any surrogate character that don't have a match into a separate
3-byte UTF-8 sequence.

This is "sort of" valid UTF-8, but not valid Unicode since the code
points used for surrogate pair encoding are not supposed to occur in a
valid Unicode string... but on the other hand the source UTF-16 data is
also broken, so we aren't really making things any worse.

This format is sometimes referred to as WTF-8 (Wobbly Translation
Format, 8-bit encoding) and is a common solution to represent broken
UTF-16 as UTF-8.

It is a lossless round-trip conversion, i.e converting from broken
UTF-16 to "WTF-8" and back to UTF-16 yields the same broken UTF-16
sequence. Because of this property it enables accessing these files
by filename through ntfs-3g and the ntfsprogs (e.g. ls -la works as
expected).

To disable this behavior you can pass the preprocessor/compiler flag
'-DNTFS_3G_ALLOW_BROKEN_SURROGATES=0' when building ntfs-3g.
---
 libntfs-3g/unistr.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/libntfs-3g/unistr.c b/libntfs-3g/unistr.c
index 7f278cd..2009f06 100644
--- a/libntfs-3g/unistr.c
+++ b/libntfs-3g/unistr.c
@@ -61,6 +61,11 @@
 
 #define NOREVBOM 0  /* JPA rejecting U+FFFE and U+FFFF, open to debate */
 
+#ifndef NTFS_3G_ALLOW_BROKEN_SURROGATES
+/* Erik allowing broken UTF-16 surrogate pairs by default, open to debate. */
+#define NTFS_3G_ALLOW_BROKEN_SURROGATES 1
+#endif /* !defined(NTFS_3G_ALLOW_BROKEN_SURROGATES) */
+
 /*
  * IMPORTANT
  * =========
@@ -462,8 +467,22 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 			if ((c >= 0xdc00) && (c < 0xe000)) {
 				surrog = FALSE;
 				count += 4;
-			} else 
+			} else {
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+				/* The first UTF-16 unit of a surrogate pair has
+				 * a value between 0xd800 and 0xdc00. It can be
+				 * encoded as an individual UTF-8 sequence if we
+				 * cannot combine it with the next UTF-16 unit
+				 * unit as a surrogate pair. */
+				surrog = FALSE;
+				count += 3;
+
+				--i;
+				continue;
+#else
 				goto fail;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
+			}
 		} else
 			if (c < 0x80)
 				count++;
@@ -473,6 +492,10 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
 				count += 3;
 			else if (c < 0xdc00)
 				surrog = TRUE;
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+			else if (c < 0xe000)
+				count += 3;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
 #if NOREVBOM
 			else if ((c >= 0xe000) && (c < 0xfffe))
 #else
@@ -548,8 +571,24 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 				*t++ = 0x80 + ((c >> 6) & 15) + ((halfpair & 3) << 4);
 				*t++ = 0x80 + (c & 63);
 				halfpair = 0;
-			} else 
+			} else {
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+				/* The first UTF-16 unit of a surrogate pair has
+				 * a value between 0xd800 and 0xdc00. It can be
+				 * encoded as an individual UTF-8 sequence if we
+				 * cannot combine it with the next UTF-16 unit
+				 * unit as a surrogate pair. */
+				*t++ = 0xe0 | (halfpair >> 12);
+				*t++ = 0x80 | ((halfpair >> 6) & 0x3f);
+				*t++ = 0x80 | (halfpair & 0x3f);
+				halfpair = 0;
+
+				--i;
+				continue;
+#else
 				goto fail;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
+			}
 		} else if (c < 0x80) {
 			*t++ = c;
 	    	} else {
@@ -562,6 +601,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
 		        	*t++ = 0x80 | (c & 0x3f);
 			} else if (c < 0xdc00)
 				halfpair = c;
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+			else if (c < 0xe000) {
+				*t++ = 0xe0 | (c >> 12);
+				*t++ = 0x80 | ((c >> 6) & 0x3f);
+				*t++ = 0x80 | (c & 0x3f);
+			}
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
 			else if (c >= 0xe000) {
 				*t++ = 0xe0 | (c >> 12);
 				*t++ = 0x80 | ((c >> 6) & 0x3f);
@@ -693,10 +739,16 @@ static int utf8_to_unicode(u32 *wc, const char *s)
 			/* Check valid ranges */
 #if NOREVBOM
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
 			  || ((*wc >= 0xe000) && (*wc <= 0xFFFD)))
 				return 3;
 #else
 			if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+			  || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
 			  || ((*wc >= 0xe000) && (*wc <= 0xFFFF)))
 				return 3;
 #endif
-- 
2.4.9 (Apple Git-60)

------------------------------------------------------------------------------

_______________________________________________
ntfs-3g-devel mailing list
ntfs-3g-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel

Re: [ntfs-3g-devel] Is there a way to ignore "Invalid or incomplete multibyte or wide character"

Reply via email to