I realize I got things mixed up a bit: Uconv is a program akin to Iconv. What we interface with is libicu.

Max

On 13 Jul 2018, at 22:50, Max Leske wrote:

Hi Alistair,

*nix systems usually come with the iconv[1] command line program that implements the normalization and denormalization algorithms, or Uconv [2] (from ICU [3]), a library that does the same thing. These algorithms include a lot of black magic and I recommend to not make your hands dirty with them. With the FFI interface Pharo has today it shouldn't be too hard to call out to Uconv (although I'm not saying it's trivial; I've written a VM plugin that we use a work to interface with Uconv and you do have to know how encodings and iconv work) or execute iconv directly.

I can send you a copy of the plugin code if you want, actually, I may put it on github if there's interest.

Cheers,
Max

[1] https://linux.die.net/man/1/iconv
[2] https://en.wikipedia.org/wiki/Uconv
[3] http://site.icu-project.org/


On 13 Jul 2018, at 20:22, Alistair Grant wrote:

Hi Sven,

Thanks very much for your quick reply...

On Fri, 13 Jul 2018 at 19:59, Sven Van Caekenberghe <s...@stfx.eu> wrote:

Alistair, are you aware of the following (article/codebase) ?

  
https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43

Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.

Sven

I hadn't seen this.  I'll read it next (although I think it will take
me longer than 17 minutes :-)).

But a quick, partial answer is that I was planning on only supporting
the composition and decomposition tables that are already included in
the main image as part of CombinedChar (see the Composition and
Decomposition class variables).

Thanks again,
Alistair


On 13 Jul 2018, at 19:46, Alistair Grant <akgrant0...@gmail.com> wrote:

Hi Sven & Everyone,

I need to convert an UTF8 encoded decomposed stream (Mac OS file
names) in to a composed string, e.g.:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776) utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
136 111 204 136 117 204 136]

In the above string, the first group of 3 accented characters are the
same as the second group of 3, but are encoded differently - code
points (228 246 252) vs (97 776 111 776 117 776).

Reading the utf8 encoded stream should result in:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
195 182 195 188]

My current thought is to write a ZnUnicodeComposingReadStream which
would wrap a ZnCharacterReadStream and return the composed characters.

What do you think?

Thanks!
Alistair



Reply via email to