[power-pro] unicode muti-word code points

entropyreduction Wed, 26 Aug 2009 09:10:56 -0700

Thinking out loud:

Problem of multibyte codepoints comes in two parts: combining character 
sequences and surrogate pairs.


combining character sequences:

  can give a wrong byte count, and is are one of the main ways 
  you can end up with the same string represented several ways, 
  so would cause comparisons to fail.  

  Normalisation form C would probably eliminate most combining 
  character sequences.  I could silently normalise all unicode 
  strings before storing them, providing I could locate either
  normaliz.dll or icu.dll (both would have to be dynamically loaded,
  because both might be absent).  
   
  But that means string stored might be different from string
  user thinks they've stored.  Maybe not good.

  Or I could perhaps provide a normalise service, up to
  user to apply it or not.

surrogate pairs:

  Entering them in number form not a serious problem.

  But they break current length service.  And 
  get_char/set_char, maybe comparisons.  Not at all sure
  what to do about them, except go whole hog and rewrite 
  unicode plugin using icu as infrastructure.

[power-pro] unicode muti-word code points

Reply via email to