Hi Erik,

I support the base of your proposal. I have tried
to test it, but it appears to fail when the bad
word does not belong in a pair.

When you have to translate an isolated 0xd800, the
computation of the utf8 size fails. I have not yet
searched further.

Back to the original problem, I have run a test
with all possible Unicode characters without finding
a defect. I suspect the original system was not using
an UTF8 locale.

Also, would you mind shortening NTFS_3G_ALLOW_BROKEN_SURROGATES
to ALLOW_BROKEN_SURROGATES ? IMHO having long (and
repeated) conditions hides the actual code...

Erik Larsson wrote:
> Hi,
>
> On 2016-04-06 19:22, Jean-Pierre André wrote:
>> Erik Larsson wrote:
>>> You are very right, but the upside is that listing the directory at
>>> least works (with the exception of the files with the bad filenames) as
>>> opposed to aborting with error as soon as a bad filename is encountered.
>>>
>>> So we are more error-tolerant with this patch... I think this is a good
>>> thing given that chkdsk doesn't appear to make any efforts at repairing
>>> this filename (it doesn't think there is any corruption on this
>>> particular volume... tested with WinXP's chkdsk and Win8's).
>>>
>>> Manufacturing a fake UTF-8 file name as a handle just to be able to
>>> access these corrupted UTF-16 filenames seems overly complex for this
>>> case... taking into account possible name collisions and such.
>>
>> I agree, this is a slippery road, and your proposal
>> will save time dealing with rare issues.
>
> I have a proposal that would enable accessing these broken files in
> ntfs-3g and the progs. The proposal involves encoding broken surrogate
> UTF-16 units into their own separate 3-byte UTF-8 sequences. This is
> sometimes referred to by the acronym WTF-8 (see:
> https://en.wikipedia.org/wiki/UTF-8#WTF-8 ).
>
> The effect is that these files aren't ignored as in the previous
> proposed patch but are included in the listing and can be looked up as
> any other file since encoding broken UTF-16 to WTF-8 and then back to
> broken UTF-16 is lossless, though the UTF-8 byte sequences returned to
> user aren't fully Unicode compliant.
> However I think this is the best we can do without starting to
> manufacture fake file names for these entries with all that complexity.
>
> Please review the attached patch.
>
> Best regards,
>
> - Erik
>
>>> On 2016-04-06 18:14, Jean-Pierre André wrote:
>>>> Hi Erik,
>>>>
>>>> Your patch will help for examining the directory, but
>>>> IMHO you will not be able the read, delete or rename
>>>> the bad file, because you will have to enter a uts8
>>>> name which will not translate to the bad Unicode for
>>>> accessing the file. Even if you use wildcards, ntfs-3g
>>>> only get requests with utf8 names.
>>>>
>>>> When accessing the directory, you will however get the
>>>> inode number to retrieve the contents using ntfscat.
>>>>
>>>> Regards
>>>>
>>>> Jean-Pierre
>>>>
>>>> Erik Larsson wrote:
>>>>> Hi,
>>>>>
>>>>> Attached to this email is a patch which does just what I suggested...
>>>>> emitting a log message but proceeding normally and ignoring the entry
>>>>> when a bad filename is encountered during readdir. This fixes the
>>>>> problem for me.
>>>>>
>>>>> Jean-Pierre, please review and decide whether this is a good idea.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> - Erik
>>>>>
>>>>> On 2016-04-06 17:27, Erik Larsson wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I looked into this image and noticed that there are 4 filenames in
>>>>>> /WINDOWS/system32 that cannot be decoded.
>>>>>>
>>>>>> One example is the MFT entry 30661 with the filename (as UTF-16
>>>>>> units): 0xDE5C 0xDC93 0x002E 0x006C 0x006F 0x0067
>>>>>> The filename ends with '.log' but the first two UTF-16 units is where
>>>>>> Unicode decoding blows up. 0xDE5C is the low value of a surrogate
>>>>>> pair
>>>>>> according to Wikipedia (range: 0xDC00-0xDFFF). We are expecting the
>>>>>> high value (0xD800-0xDBFF) to come first.
>>>>>> It is then followed by another low value of a surrogate pair, 0xDC93.
>>>>>> This is clearly a corruption... a surrogate pair should consist of a
>>>>>> high value followed by a low value.
>>>>>>
>>>>>> I have no idea how this file was created... if Windows did this, then
>>>>>> we might need to be able to cope with such corruption better (e.g.
>>>>>> ignoring the entry during readdir and just emit a log message).
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>> On 2016-04-06 13:06, Richard W.M. Jones wrote:
>>>>>>> The reporter kindly gave me permission to distribute the metadata
>>>>>>> file.  I've put it up here:
>>>>>>>
>>>>>>>    http://oirase.annexia.org/tmp/bz1301593/
>>>>>>>
>>>>>>>    $ md5sum ntfsclone_sda2.xz
>>>>>>>    6cadc64de3196311c8159dc12f84484c  ntfsclone_sda2.xz
>>>>>>>
>>>>>>> Rich.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>



------------------------------------------------------------------------------
_______________________________________________
ntfs-3g-devel mailing list
ntfs-3g-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel

Reply via email to