As a bit of background, I am incorporating a somewhat customized version of Apache 2.0.x into a software product here. Our product installer allows the user to specify an install-to directory. On foreign-language systems (e.g. Japanese) this could be a directory name with multibyte characters in it.
Our I18N testing found that there are problems if our Apache is installed into directories with certain "bad" doublebyte characters. For instance, suppose the root of the Apache tree is in a directory whose name ends in the Japanese character whose code is 0x835c (I won't try to enter the character here, since I'm writing this email on an English-language OS). In this case in the httpd.conf file there will be a directive ServerRoot dirname where dirname is the Japanese string ending in this character (and there will be other directives in the config file containing this string as well). Now, 0x5c is the ASCII code for '\', but in this case it is not a backslash, it's the 2nd byte of a 2-byte Japanese character code. What happens in this situation is that Apache complains at startup with a message to the effect that ServerRoot must be a valid directory, and it fails to start. Digging into the source code, I find a couple of underlying problems that account for this. The filename-parsing code in srclib/apr/file_io/win32/filepath.c scans filenames character-by-character looking for special characters like '/' or '\', but the code is not cognizent of multibyte characters. So it can be fooled into thinking a byte is a '\' separator when it is really the second byte of a multibyte character in the filename. Even if I correct this, I run into problems elsewhere. For example, the code in server/util.c which helps parse config files also has byte-by-byte processing which is not multibyte-aware. So it can be fooled into thinking the 2nd byte of a 2-byte character is a continuation character, if that second byte is '\'. There may be other places in Apache's source code that have internationalization problems as well; the above are just a couple I found so far. What do folks think? Is this too much of an edge case to care about, or is it worth fixing? Obviously Apache handles multibyte *content* OK. But it does seem to have some problems dealing with multibyte directory names and with multibyte characters in the config file. Is this a known issue? Is this something anyone has plans to fix? If I fix it, would apache.org be interested in taking back the above-mentioned source files with the fixes? Thanks, Rich Title Rational Software
