Hi, > Thanks for you answers. Again, as I discovered Source-highlight very > recently, I don't know if Unicode is an important feature for you or > not... I read sometimes source code from Japanese or Chinese > developers, and am French myself, so that's not unusual to store code > or text files in Unicode (I mostly work with Visual Studio).
I would say that Unicode is an essential feature. In fact, I thought Source-highlight was already Unicode-compliant, since this is 2010 and is hard to imagine an application that isn't. > Unicode files (UTF-8 for example, which is widely used on the Internet) > can store characters on 1 to 6 bytes. So of course it's very difficult > to use (length() and so are difficult) I think you are exaggerating the difficulty of dealing with variable-length encodings such as UTF-8. In fact, almost every library I know that deals with Unicode does so using the UTF-8 encoding. Sure, finding the Nth element of a string is a O(n) operation instead of O(1), but many other common operations such as strcpy() and strcat() are done the same way as with a fixed-length encoding. > 1) First you have to know if the file is Unicode or not. They > should have a header, described here: Some of us use Source-highlight as a library, and therefore that determination should be made by the main application. I suggest that the core functions of Source-highlight be parameterised over the encoding used. Almost everyone uses either single-byte (non-Unicode, thus) or Unicode in the form of UTF-8. There's also some UTF-16 out there and even UTF-32 (aka UCS-4), but these are less common. In fact, if Source-highlight were to support only single-byte encodings and UTF-8, I would deem it Unicode-compliant. > 2) The second thing is to convert the whole file to a "fixed bytes > per character" format, so you can work with it. A wide char format > (16 bits wchar) is a good choice most of the time. Actually, 16-bits wchar is a terrible choice, since Unicode code-points require 32-bits. Also, you don't necessarily need to convert the whole file to a fixed-length encoding. Why not simply work natively in UTF-8? It's really not as difficult as you make it to be... > Don't know too much on the Linux side, but it's simply a matter of > wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with > Visual Studio. Using UTF-8, you will need a special length() function, but you can use the regular strcpy() and strcat(). I don't use C++, but a quick google search tells me there are libraries out there that provide UTF-8 support for C++. Cheers, Dario Teixeira _______________________________________________ Help-source-highlight mailing list [email protected] http://lists.gnu.org/mailman/listinfo/help-source-highlight
