Re: [Pharo-dev] ZnUnicodeComposingReadStream?

Max Leske Fri, 13 Jul 2018 23:21:46 -0700

I realize I got things mixed up a bit: Uconv is a program akin to Iconv.What we interface with is libicu.

Max


On 13 Jul 2018, at 22:50, Max Leske wrote:

Hi Alistair,
*nix systems usually come with the iconv[1] command line program thatimplements the normalization and denormalization algorithms, or Uconv[2] (from ICU [3]), a library that does the same thing. Thesealgorithms include a lot of black magic and I recommend to not makeyour hands dirty with them. With the FFI interface Pharo has today itshouldn't be too hard to call out to Uconv (although I'm not sayingit's trivial; I've written a VM plugin that we use a work to interfacewith Uconv and you do have to know how encodings and iconv work) orexecute iconv directly.
I can send you a copy of the plugin code if you want, actually, I mayput it on github if there's interest.
Cheers,
Max

[1] https://linux.die.net/man/1/iconv
[2] https://en.wikipedia.org/wiki/Uconv
[3] http://site.icu-project.org/


On 13 Jul 2018, at 20:22, Alistair Grant wrote:
Hi Sven,

Thanks very much for your quick reply...
On Fri, 13 Jul 2018 at 19:59, Sven Van Caekenberghe <[email protected]>wrote:
Alistair, are you aware of the following (article/codebase) ?

  
https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
Due to the size of the full DB it is doubtful it would become astandard part of Pharo though.
Sven
I hadn't seen this.  I'll read it next (although I think it will take
me longer than 17 minutes :-)).

But a quick, partial answer is that I was planning on only supporting
the composition and decomposition tables that are already included in
the main image as part of CombinedChar (see the Composition and
Decomposition class variables).

Thanks again,
Alistair
On 13 Jul 2018, at 19:46, Alistair Grant <[email protected]>wrote:
Hi Sven & Everyone,

I need to convert an UTF8 encoded decomposed stream (Mac OS file
names) in to a composed string, e.g.:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117776)utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97204
136 111 204 136 117 204 136]
In the above string, the first group of 3 accented characters arethe
same as the second group of 3, but are encoded differently - code
points (228 246 252) vs (97 776 111 776 117 776).

Reading the utf8 encoded stream should result in:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195164
195 182 195 188]

My current thought is to write a ZnUnicodeComposingReadStream which
would wrap a ZnCharacterReadStream and return the composedcharacters.
What do you think?

Thanks!
Alistair

Re: [Pharo-dev] ZnUnicodeComposingReadStream?

Reply via email to