Alistair, are you aware of the following (article/codebase) ? https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
Due to the size of the full DB it is doubtful it would become a standard part of Pharo though. Sven > On 13 Jul 2018, at 19:46, Alistair Grant <[email protected]> wrote: > > Hi Sven & Everyone, > > I need to convert an UTF8 encoded decomposed stream (Mac OS file > names) in to a composed string, e.g.: > > string: 'test-äöü-äöü' > code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776) > utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204 > 136 111 204 136 117 204 136] > > In the above string, the first group of 3 accented characters are the > same as the second group of 3, but are encoded differently - code > points (228 246 252) vs (97 776 111 776 117 776). > > Reading the utf8 encoded stream should result in: > > string: 'test-äöü-äöü' > code points: #(116 101 115 116 45 228 246 252 45 228 246 252) > utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164 > 195 182 195 188] > > My current thought is to write a ZnUnicodeComposingReadStream which > would wrap a ZnCharacterReadStream and return the composed characters. > > What do you think? > > Thanks! > Alistair >
