Hi Erik, I support the base of your proposal. I have tried to test it, but it appears to fail when the bad word does not belong in a pair.
When you have to translate an isolated 0xd800, the computation of the utf8 size fails. I have not yet searched further. Back to the original problem, I have run a test with all possible Unicode characters without finding a defect. I suspect the original system was not using an UTF8 locale. Also, would you mind shortening NTFS_3G_ALLOW_BROKEN_SURROGATES to ALLOW_BROKEN_SURROGATES ? IMHO having long (and repeated) conditions hides the actual code... Erik Larsson wrote: > Hi, > > On 2016-04-06 19:22, Jean-Pierre André wrote: >> Erik Larsson wrote: >>> You are very right, but the upside is that listing the directory at >>> least works (with the exception of the files with the bad filenames) as >>> opposed to aborting with error as soon as a bad filename is encountered. >>> >>> So we are more error-tolerant with this patch... I think this is a good >>> thing given that chkdsk doesn't appear to make any efforts at repairing >>> this filename (it doesn't think there is any corruption on this >>> particular volume... tested with WinXP's chkdsk and Win8's). >>> >>> Manufacturing a fake UTF-8 file name as a handle just to be able to >>> access these corrupted UTF-16 filenames seems overly complex for this >>> case... taking into account possible name collisions and such. >> >> I agree, this is a slippery road, and your proposal >> will save time dealing with rare issues. > > I have a proposal that would enable accessing these broken files in > ntfs-3g and the progs. The proposal involves encoding broken surrogate > UTF-16 units into their own separate 3-byte UTF-8 sequences. This is > sometimes referred to by the acronym WTF-8 (see: > https://en.wikipedia.org/wiki/UTF-8#WTF-8 ). > > The effect is that these files aren't ignored as in the previous > proposed patch but are included in the listing and can be looked up as > any other file since encoding broken UTF-16 to WTF-8 and then back to > broken UTF-16 is lossless, though the UTF-8 byte sequences returned to > user aren't fully Unicode compliant. > However I think this is the best we can do without starting to > manufacture fake file names for these entries with all that complexity. > > Please review the attached patch. > > Best regards, > > - Erik > >>> On 2016-04-06 18:14, Jean-Pierre André wrote: >>>> Hi Erik, >>>> >>>> Your patch will help for examining the directory, but >>>> IMHO you will not be able the read, delete or rename >>>> the bad file, because you will have to enter a uts8 >>>> name which will not translate to the bad Unicode for >>>> accessing the file. Even if you use wildcards, ntfs-3g >>>> only get requests with utf8 names. >>>> >>>> When accessing the directory, you will however get the >>>> inode number to retrieve the contents using ntfscat. >>>> >>>> Regards >>>> >>>> Jean-Pierre >>>> >>>> Erik Larsson wrote: >>>>> Hi, >>>>> >>>>> Attached to this email is a patch which does just what I suggested... >>>>> emitting a log message but proceeding normally and ignoring the entry >>>>> when a bad filename is encountered during readdir. This fixes the >>>>> problem for me. >>>>> >>>>> Jean-Pierre, please review and decide whether this is a good idea. >>>>> >>>>> Best regards, >>>>> >>>>> - Erik >>>>> >>>>> On 2016-04-06 17:27, Erik Larsson wrote: >>>>>> Hi, >>>>>> >>>>>> I looked into this image and noticed that there are 4 filenames in >>>>>> /WINDOWS/system32 that cannot be decoded. >>>>>> >>>>>> One example is the MFT entry 30661 with the filename (as UTF-16 >>>>>> units): 0xDE5C 0xDC93 0x002E 0x006C 0x006F 0x0067 >>>>>> The filename ends with '.log' but the first two UTF-16 units is where >>>>>> Unicode decoding blows up. 0xDE5C is the low value of a surrogate >>>>>> pair >>>>>> according to Wikipedia (range: 0xDC00-0xDFFF). We are expecting the >>>>>> high value (0xD800-0xDBFF) to come first. >>>>>> It is then followed by another low value of a surrogate pair, 0xDC93. >>>>>> This is clearly a corruption... a surrogate pair should consist of a >>>>>> high value followed by a low value. >>>>>> >>>>>> I have no idea how this file was created... if Windows did this, then >>>>>> we might need to be able to cope with such corruption better (e.g. >>>>>> ignoring the entry during readdir and just emit a log message). >>>>>> >>>>>> Best regards, >>>>>> >>>>>> - Erik >>>>>> >>>>>> On 2016-04-06 13:06, Richard W.M. Jones wrote: >>>>>>> The reporter kindly gave me permission to distribute the metadata >>>>>>> file. I've put it up here: >>>>>>> >>>>>>> http://oirase.annexia.org/tmp/bz1301593/ >>>>>>> >>>>>>> $ md5sum ntfsclone_sda2.xz >>>>>>> 6cadc64de3196311c8159dc12f84484c ntfsclone_sda2.xz >>>>>>> >>>>>>> Rich. >>>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> > ------------------------------------------------------------------------------ _______________________________________________ ntfs-3g-devel mailing list ntfs-3g-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ntfs-3g-devel