[abcusers] File names [was: reusable parser]

John Chambers Wed, 28 Apr 2004 09:34:38 -0700

Martin Tarenskeen writes:
| On Tue, 27 Apr 2004, Stephen Kellett wrote:
| > John Chambers wrote:
| > >OSX presents an interesting portability challenge: The  default  file
| > >system  has "caseless" file names.  If you look around, you might not
| > >notice this, because mixed-case names abound. But the case of letters
| > >isn't significant when opening files.
| >
| > You have the same problem on Windows. Windows supports both upper and
| > lower case letters in filenames, however filename matching is case
| > insensitive.
| >
| > Try creating textfile.txt and Textfile.txt in the same directory. Can't
| > do it.
| >
| I have an Atari Falcon030 computer running the FreeMiNT operating system.
| (Never heard of ? Never mind :-) It is a sort of hybrid OS a little bit
| like OSX. It is a mix of the TOS operating system that is in the ROM of
| classic Atari computers, combined with a Unix-like multitasking OS. I have
| several partitions on my harddisk with different filesystems. On one
| partition I have a ext2 system that is really case sensitive. On drive
| C:\ I need a FAT filesystem with the old fashion 8+3 case-insensitive DOS
| file names. On another drive I have VFAT: long filenames, with upper- and
| lowercase, but not really case-sensitive.


Hey; you seem to have the worst situation of all. ;-)

One of the lessons about software engineering that I  remember  as  a
strong  point  in  several  classes  was  the  general idea that such
"policy" decisions don't properly belong in the lower levels  of  the
OS; they belong up in the "application" or "library" level.

The unix kernel's approach was often used as an example of the  right
way  to  do  it:  The  kernel  itself  treats  a  file name as just a
character string, and the only special characters are the '/' and the
final  NULL  char.   The  rest are "just chars" with no meaning.  The
kernel just implements file-access mechanisms; "policy" decisions are
the responsibility of the application level.

The advantage of this is that it's easy to implement a  name-matching
policy in a library file-open routine.  Suppose you want to implement
caseless matching.  First decide on your alphabet (7-bit  ASCII  that
ignores  the 8th bit; Latin-1; ISO-8859-7, whatever) so you know what
are upper- and lower-case letters. Then your open routine first calls
the system open() routine.  If that succeeds, fine.  If not, you pass
the name to your filenamematch() routine.  It splits the name into  a
directory  part  and  a  filename  part,  does  a  readdir()  on  the
directory, runs through the list of filenames, and  applies  whatever
test  you  want  on  each  one.  When it gets a match, it returns the
matched filename to the caller, which opens that file.

I've done this on a number of projects, and it really is  that  easy.
Well,  sometimes  you  want  to  apply  the matching to the directory
portion, too, but that's a simple recursive call.

The best example of why this is the right approach is in the  growing
problem  of  "internationalization".  We have any number of competing
character sets these days.  What's an upper- or lower-case letter  is
different in different character sets. Some alphabets don't even have
a case distinction. Some (such as German) even have letters that only
come  in  one case.  Others (Hebrew, Arabic) have don't have case but
have letters that have several forms, and you  might  want  to  treat
variants on a letter as equal.

If your OS does this, then it *will* get it wrong  for  most  of  the
possible alphabets, and there's nothing you can do to fix it.  If the
OS just says "a character is a chunk of bits  without  meaning",  and
the  meaning  is up in the runtime libraries, then it's easy to fix a
problem.  You just change the library that you're using.

Lest you think this is way off topic, I might mention that I've  been
involved  in  attempts to use non-ASCII char sets in my ABC tunes.  I
have a lot of "international folk  dance"  tunes,  and  it  would  be
really nice to be able to spell the titles right. Also, I like to use
single-tune files as my  primary  data  (with  little  programs  that
combine them for pages of tunes). It's really handy if the tune title
can be used in the file name.  I've done this on my linux box, and at
least Latin-1 names work there.  But when I rsync a directory over to
my Mac Powerbook, it goes berserk on the files with non-ASCII letters
in the names.

This tells me that OSX "isn't ready for prime  time"  in  the  coming
international world. If it can't even handle a simple 'ä' or 'ö' in a
file name, how is it ever going to handle Chinese  or  Japanese  file
names?  It can't even handle a Finnish or Arabic file name. You can't
expect those people to use English file names.  (Well, the  Finns  do
all speak English these days, but still ...  ;-)

Actually, my linux box can't handle Chinese file names  yet,  either.
But  there's  a Chinese version of linux being developed in China, as
an official computer platform for the government  and  industry.   It
will  be  able  to  do the job right.  And I'll bet it will sell well
outside of Asia.  People making "world music" collections will want a
system  like  that.   And  programmers  will appreciate a system that
doesn't force you to fit your names into an English character set.

One of their reason for standardizing on linux was that  it's  an  OS
that  has  no builtin rules for what's a valid file name.  So there's
very little in the kernel to undo.  Really all that's necessary is  a
safe way to handle multi-byte chars so that a 16-bit char with '/' in
one of its 8-bit halves isn't treated as a directory separator.

One question for our Scandinavian friends: Do any of  you  use  Macs?
Can  you  get  filenames  that  contain the non-ASCII letters in your
alphabet? If so, how do you make it work right? I've tried setting my
charsets  to  8859-1  and  UTF-8 and others, and none of them seem to
make the files in my .../Scand/  directory  copy  correctly  from  my
linux  box.  Copying between linux to this FreeBSD system works fine,
because those systems treat a character as unanalyzed bits.  But when
copying to OSX, those files end up with gibberish names.

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

[abcusers] File names [was: reusable parser]

Reply via email to