On Monday 08 May 2017 15:13:28 Vladimir 'phcoder' Serbinenko wrote: > On Mon, Apr 10, 2017, 23:17 Pali Rohár <pali.ro...@gmail.com> wrote: > > -read_string (const grub_uint8_t *raw, grub_size_t sz, char > > *outbuf) +read_string (const grub_uint8_t *raw, grub_size_t sz, > > char *outbuf, int normalize_utf8) > > Normalize isn't the right word. And it's not utf-8 but latin1 (called > compressed utf-16 by udf docs). > Are you sure you handle utf-16 case correctly? What is the expected > behavior in those cases? Ideally you may want to just parse raw > string in caller
Hi! Now I looked at OSTA UDF spec again and found reason for my disinformation... libblkid has wrongly implemented 8bit OSTA compressed unicode and I just tried to mimic libblkid in grub... libblkid handles 16bit OSTA compressed unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8. In UDF 2.01 specification is written: ==== For a CompressionID of 8 or 16, the value of the CompressionID shall specify the number of BitsPerCharacter for the d-characters defined in the CharacterBitStream field. Each sequence of CompressionID bits in the CharacterBitStream field shall represent an OSTA Compressed Unicode d- character. The bits of the character being encoded shall be added to the CharacterBitStream from most- to least-significant-bit. The bits shall be added to the CharacterBitStream starting from the most significant bit of the current byte being encoded into. The value of the OSTA Compressed Unicode d-character interpreted as a Uint16 defines the value of the corresponding d-character in the Unicode 2.0 standard. ==== So it means that 8bit OSTA compressed unicode buffer contains sequence of Unicode codepoints, one per 8 bits. What effectively means equivalence with Latin1 (ISO-8859-1) encoding. And 16bit OSTA compressed unicode means sequence of Unicode codepoints, one per 16 bits in big endian. What is probably only UCS-2 and not full UTF-16. So problem is with 8bit OSTA compressed unicode if contains bytes which are not UTF-8 invariants (ASCII). As those those are decoded differently with Latin1 and UTF-8. (Please correct me if I'm wrong here) For now rather scratch/suspend this my patch until we decide what to do with it due to different/wrong implementation of reading strings in libblkid from util-linux. -- Pali Rohár pali.ro...@gmail.com
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Grub-devel mailing list Grub-devel@gnu.org https://lists.gnu.org/mailman/listinfo/grub-devel