Re: [Dev-luatex] Filename encoding

Philipp Maximilian Stephani Wed, 08 Jan 2014 11:16:46 -0800

On Tue Jan 07 2014 at 11:02:51, Javier Múgica de Rivera <
javieraritz.riba...@gmail.com> wrote:


> >> utf8 -> w_char's     (Provide some dummy solutión for values >2^16,
> >> e.g.  c & 0xFFFF)
> >>
> >
> > Don't use the dummy solution; Windows uses UTF-16.
> >
> >
> >> w_char's -> chars via wcstombs()
> >>
> >>
> > No, Windows uses UTF-16. This step is unnecessary and harmful.
>
> You know a good deal more than me about Window's internals. I thougth
> that within w_char strings on Windows each character represented
> itself.
> wcstombs() is affected by the locale settings. According to the C
>
>
Yes, that's true, but filenames should be careful about locale-based
conversions:
1. On Unix systems file names are byte sequences. If you represent file
names as byte sequences in your application (LuaTeX does this), then you
should not try to interpret the file names in any way because doing so
might prevent applications from being able to access certain files. OTOH,
if your application uses Unicode strings (e.g. Java, Python 3), you have to
do some encoding, and that is not trivial (e.g. Python goes through some
hoops to make sure that file names that cannot be represented in Unicode,
e.g. invalid UTF-8 strings, are handled correctly). Interpretation of file
names is generally (but not necessarily) locale-dependent; the locale on
modern installations happens to default to UTF-8, but nothing stops
applications from creating files with names that are not valid UTF-8
strings. The GLib functions fulfill these requirements and are
interface-compatible with standard C functions, that's why I recommended
them.
2. On Windows file names are sequences of 16-bit values. For all intents
and purposes these are interpreted as UTF-16 strings, but in fact they can
be invalid (unpaired surrogates) as well. If your application uses byte
strings to store file names, to be able to access files with arbitrary but
valid Unicode names UTF-8 should be used for these. If your application
uses Unicode strings, they have to be converted to UTF-16. Invalid UTF-16
file names can be handled if your application stores code point sequences
instead of scalar value sequences. (Many libraries don't check this.)


> > The opposite is true: Windows never uses locale information for filenames
> > (it always uses UTF-16 de facto), but the locale is used on Linux.
>
> I supposed it was used BOTH on Windows an Linux, but that on Linux it
> was never necessary due to it using UTF-8 naturally. I had noticed
> that the C run-time doc's from Visual Studio does not mention anything
> about encodings for fopen or other filename related fuction, but
> supposed the char* filename was interpreted according to the locale
> settings.


This is true. Note the difference: it's the OS interface, not the OS
kernel, that does this interpretation. This interpretation is included for
compatibility with Windows 9x, but is broken because it doesn't provide
access to arbitrary Unicode strings. It should be avoided.


> But... given that the string has to be passed to UTF-16 it
> must be interpreted somehow, isn't it?
>
>
Interpreted by whom? AFAIK neither the Windows nor the Linux kernels
interpret file names in any encoding-related way. (OS X might be more
complex because of its Mach/XNU/BSD/Carbon/Cocoa layering.) The difference
is that they use code units of different widths (8 bits vs. 16 bits).
Applications do have to interpret these strings in some meaningful way, and
in order to do so, they need to decide whether to use byte strings or
Unicode strings as data model.

_______________________________________________
dev-luatex mailing list
dev-luatex@ntg.nl
http://www.ntg.nl/mailman/listinfo/dev-luatex

Re: [Dev-luatex] Filename encoding

Reply via email to