Re: Lazy man's UTF8

Michael B. Allen Wed, 18 Sep 2002 23:35:28 -0700

On Wed, 18 Sep 2002 22:26:40 +0100 (BST)
Robert de Bath <[email protected]> wrote:

> On Wed, 18 Sep 2002, Michael B. Allen wrote:
> 
> > Perhaps encdec. The interface is a little nicer for common situations.
> >
> >   http://freshmeat.net/projects/encdec/
> 
> Nice, still not as 'smooth' as I'd hoped and I don't think there's any
> support for 'character' counting as opposed to 'display cell' counting.

The  encdec package is not a "unicode conversion library". The other common
misconception  it to think it's a set of serialization primatives like XDR.
It can be used very effectively in that way but that's incedental. 

Encdec's  real  function  is to pick apart arbitary binary file formats and
network  messages. I have used it extensivly to decode and *encode* MS SMB,
MS Word 97, MS Structured Storage compound documents (encodes DIRENTs as RB
trees!), MS Enhanced Metafiles (EMF), TI coff images into PalmOS PDB files,
... etc. 

The  point is that when doing this sort of thing you never know what you're
going  to run into. MS formats in particular will have a UCS-2LE pascal-ish
string and then a cp1250 right next to it. There might be some field that's
supposed  to  be  N *number of characters* encoded in some array. Yes, this
somewhat rare but it does happen and in my experiance it is not very common
to  limit  by display positions when doing this kind of work either. At the
time  my reasoning was that it is safer to model the concept of a string as
a  sequence of characters (see sig) and I still believe that would be ideal
if it did not incur unacceptable performance limits.

The  encdec  string  interface is designed to be as open ended as possible.
You  can limit by source bytes, destination bytes, and character count. You
can  use  -1  for  all  and  stop at '\0' or use all limits or some and not
others.  I  might  change  that  cn  limit  to a pn in a future version but
portability   is   more   of   an  issue  at  the  moment  as  it  requires
__STDC_ISO_10646__.

Is  libiconv  capable  of  doing  wchar_t,  UCS-4,  and UTF-8 operations on
Windows? I couldn't even build it (although I didn't try very hard). 

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and more importantly to tasks that have not
yet been conceived. 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

Reply via email to