Hi,
On 2016-04-06 19:22, Jean-Pierre André wrote:
Erik Larsson wrote:
You are very right, but the upside is that listing the directory at
least works (with the exception of the files with the bad filenames) as
opposed to aborting with error as soon as a bad filename is encountered.
So we are more error-tolerant with this patch... I think this is a good
thing given that chkdsk doesn't appear to make any efforts at repairing
this filename (it doesn't think there is any corruption on this
particular volume... tested with WinXP's chkdsk and Win8's).
Manufacturing a fake UTF-8 file name as a handle just to be able to
access these corrupted UTF-16 filenames seems overly complex for this
case... taking into account possible name collisions and such.
I agree, this is a slippery road, and your proposal
will save time dealing with rare issues.
I have a proposal that would enable accessing these broken files in
ntfs-3g and the progs. The proposal involves encoding broken surrogate
UTF-16 units into their own separate 3-byte UTF-8 sequences. This is
sometimes referred to by the acronym WTF-8 (see:
https://en.wikipedia.org/wiki/UTF-8#WTF-8 ).
The effect is that these files aren't ignored as in the previous
proposed patch but are included in the listing and can be looked up as
any other file since encoding broken UTF-16 to WTF-8 and then back to
broken UTF-16 is lossless, though the UTF-8 byte sequences returned to
user aren't fully Unicode compliant.
However I think this is the best we can do without starting to
manufacture fake file names for these entries with all that complexity.
Please review the attached patch.
Best regards,
- Erik
On 2016-04-06 18:14, Jean-Pierre André wrote:
Hi Erik,
Your patch will help for examining the directory, but
IMHO you will not be able the read, delete or rename
the bad file, because you will have to enter a uts8
name which will not translate to the bad Unicode for
accessing the file. Even if you use wildcards, ntfs-3g
only get requests with utf8 names.
When accessing the directory, you will however get the
inode number to retrieve the contents using ntfscat.
Regards
Jean-Pierre
Erik Larsson wrote:
Hi,
Attached to this email is a patch which does just what I suggested...
emitting a log message but proceeding normally and ignoring the entry
when a bad filename is encountered during readdir. This fixes the
problem for me.
Jean-Pierre, please review and decide whether this is a good idea.
Best regards,
- Erik
On 2016-04-06 17:27, Erik Larsson wrote:
Hi,
I looked into this image and noticed that there are 4 filenames in
/WINDOWS/system32 that cannot be decoded.
One example is the MFT entry 30661 with the filename (as UTF-16
units): 0xDE5C 0xDC93 0x002E 0x006C 0x006F 0x0067
The filename ends with '.log' but the first two UTF-16 units is where
Unicode decoding blows up. 0xDE5C is the low value of a surrogate
pair
according to Wikipedia (range: 0xDC00-0xDFFF). We are expecting the
high value (0xD800-0xDBFF) to come first.
It is then followed by another low value of a surrogate pair, 0xDC93.
This is clearly a corruption... a surrogate pair should consist of a
high value followed by a low value.
I have no idea how this file was created... if Windows did this, then
we might need to be able to cope with such corruption better (e.g.
ignoring the entry during readdir and just emit a log message).
Best regards,
- Erik
On 2016-04-06 13:06, Richard W.M. Jones wrote:
The reporter kindly gave me permission to distribute the metadata
file. I've put it up here:
http://oirase.annexia.org/tmp/bz1301593/
$ md5sum ntfsclone_sda2.xz
6cadc64de3196311c8159dc12f84484c ntfsclone_sda2.xz
Rich.
>From 62783df3936d7d60d4c2629d3d6253137911054e Mon Sep 17 00:00:00 2001
From: Erik Larsson <mec...@users.sourceforge.net>
Date: Thu, 7 Apr 2016 10:51:09 +0200
Subject: [PATCH] unistr.c: Enable encoding broken UTF-16 into broken UTF-8,
A.K.A. WTF-8.
Windows filenames may contain invalid UTF-16 sequences (specifically
broken surrogate pairs), which cannot be converted to UTF-8 if we do
strict conversion.
This patch enables encoding broken UTF-16 into similarly broken UTF-8 by
encoding any surrogate character that don't have a match into a separate
3-byte UTF-8 sequence.
This is "sort of" valid UTF-8, but not valid Unicode since the code
points used for surrogate pair encoding are not supposed to occur in a
valid Unicode string... but on the other hand the source UTF-16 data is
also broken, so we aren't really making things any worse.
This format is sometimes referred to as WTF-8 (Wobbly Translation
Format, 8-bit encoding) and is a common solution to represent broken
UTF-16 as UTF-8.
It is a lossless round-trip conversion, i.e converting from broken
UTF-16 to "WTF-8" and back to UTF-16 yields the same broken UTF-16
sequence. Because of this property it enables accessing these files
by filename through ntfs-3g and the ntfsprogs (e.g. ls -la works as
expected).
To disable this behavior you can pass the preprocessor/compiler flag
'-DNTFS_3G_ALLOW_BROKEN_SURROGATES=0' when building ntfs-3g.
---
libntfs-3g/unistr.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 2 deletions(-)
diff --git a/libntfs-3g/unistr.c b/libntfs-3g/unistr.c
index 7f278cd..2009f06 100644
--- a/libntfs-3g/unistr.c
+++ b/libntfs-3g/unistr.c
@@ -61,6 +61,11 @@
#define NOREVBOM 0 /* JPA rejecting U+FFFE and U+FFFF, open to debate */
+#ifndef NTFS_3G_ALLOW_BROKEN_SURROGATES
+/* Erik allowing broken UTF-16 surrogate pairs by default, open to debate. */
+#define NTFS_3G_ALLOW_BROKEN_SURROGATES 1
+#endif /* !defined(NTFS_3G_ALLOW_BROKEN_SURROGATES) */
+
/*
* IMPORTANT
* =========
@@ -462,8 +467,22 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
if ((c >= 0xdc00) && (c < 0xe000)) {
surrog = FALSE;
count += 4;
- } else
+ } else {
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ /* The first UTF-16 unit of a surrogate pair has
+ * a value between 0xd800 and 0xdc00. It can be
+ * encoded as an individual UTF-8 sequence if we
+ * cannot combine it with the next UTF-16 unit
+ * unit as a surrogate pair. */
+ surrog = FALSE;
+ count += 3;
+
+ --i;
+ continue;
+#else
goto fail;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
+ }
} else
if (c < 0x80)
count++;
@@ -473,6 +492,10 @@ static int utf16_to_utf8_size(const ntfschar *ins, const int ins_len, int outs_l
count += 3;
else if (c < 0xdc00)
surrog = TRUE;
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ else if (c < 0xe000)
+ count += 3;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
#if NOREVBOM
else if ((c >= 0xe000) && (c < 0xfffe))
#else
@@ -548,8 +571,24 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
*t++ = 0x80 + ((c >> 6) & 15) + ((halfpair & 3) << 4);
*t++ = 0x80 + (c & 63);
halfpair = 0;
- } else
+ } else {
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ /* The first UTF-16 unit of a surrogate pair has
+ * a value between 0xd800 and 0xdc00. It can be
+ * encoded as an individual UTF-8 sequence if we
+ * cannot combine it with the next UTF-16 unit
+ * unit as a surrogate pair. */
+ *t++ = 0xe0 | (halfpair >> 12);
+ *t++ = 0x80 | ((halfpair >> 6) & 0x3f);
+ *t++ = 0x80 | (halfpair & 0x3f);
+ halfpair = 0;
+
+ --i;
+ continue;
+#else
goto fail;
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
+ }
} else if (c < 0x80) {
*t++ = c;
} else {
@@ -562,6 +601,13 @@ static int ntfs_utf16_to_utf8(const ntfschar *ins, const int ins_len,
*t++ = 0x80 | (c & 0x3f);
} else if (c < 0xdc00)
halfpair = c;
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ else if (c < 0xe000) {
+ *t++ = 0xe0 | (c >> 12);
+ *t++ = 0x80 | ((c >> 6) & 0x3f);
+ *t++ = 0x80 | (c & 0x3f);
+ }
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
else if (c >= 0xe000) {
*t++ = 0xe0 | (c >> 12);
*t++ = 0x80 | ((c >> 6) & 0x3f);
@@ -693,10 +739,16 @@ static int utf8_to_unicode(u32 *wc, const char *s)
/* Check valid ranges */
#if NOREVBOM
if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
|| ((*wc >= 0xe000) && (*wc <= 0xFFFD)))
return 3;
#else
if (((*wc >= 0x800) && (*wc <= 0xD7FF))
+#if NTFS_3G_ALLOW_BROKEN_SURROGATES
+ || ((*wc >= 0xD800) && (*wc <= 0xDFFF))
+#endif /* NTFS_3G_ALLOW_BROKEN_SURROGATES */
|| ((*wc >= 0xe000) && (*wc <= 0xFFFF)))
return 3;
#endif
--
2.4.9 (Apple Git-60)
------------------------------------------------------------------------------
_______________________________________________
ntfs-3g-devel mailing list
ntfs-3g-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel