On Thursday, April 26, 2012 17:26:40 H. S. Teoh wrote: > Currently, std.uni code (argh the pun!!) is hand-written with tables of > which character belongs to which class, etc.. These hand-coded tables > are error-prone and unnecessary. For example, think of computing the > layout width of a UTF-8 stream. Why waste time decoding into dchar, and > then doing all sorts of table lookups to compute the width? Instead, > treat the stream as a byte stream, with certain sequences of bytes > evaluating to length 2, others to length 1, and yet others to length 0. > > A lexer engine is perfectly suited for recognizing these kinds of > sequences with optimal speed. The only difference from a real lexer is > that instead of spitting out tokens, it keeps a running total (layout) > length, which is output at the end. > > So what we should do is to write a tool that processes Unicode.txt (the > official table of character properties from the Unicode standard) and > generates lexer engines that compute various Unicode properties > (grapheme count, layout length, etc.) for each of the UTF encodings. > > This way, we get optimal speed for these algorithms, plus we don't need > to manually maintain tables and stuff, we just run the tool on > Unicode.txt each time there's a new Unicode release, and the correct > code will be generated automatically.
That's a fantastic idea! Of course, that leaves the job of implementing it... :) - Jonathan M Davis
