On Tue, Feb 22, 2011 at 07:41:12PM +0100, Branko Čibej wrote: > On 22.02.2011 18:17, Julian Foad wrote: > >> Proposed Support Library > >> ======================== > >> > >> Assumptions > >> ----------- > >> > >> The main assumption is that we'll keep using APR for character set > > s/character set/character encoding/. > > > >> conversion, meaning that the recoding solution to choose would not > >> need to provide any other functionality than recoding. > > s/recoding/converting between NFD and NFC UTF8 encodings/. > > Actually -- you have to go all the way and support complete > normalization, even if your normalization targets are only NFC and NFD. > That's because there isn't a sane way to detect whether a string is > normalized or not -- "sane" in the sense that it should take about as > long to discover that as to just normalize it.
To put it differently, the only way to figure out whether a given UTF-8 sequence is valid (or, by extension, uses NFC and/or NFD) is to parse the entire sequence.