On 14/12/2008, at 4:48 PM, Paul Davis wrote:
I have to say I always kind of assumed that most filesystems only
allowed Latin based characters in the name. I got interested so I
asked the guys in the IRC channel about non-latin characters in
filenames and someone actually just created a file on ext3 with
japanese characters and everythign worked fine.
Someone pasted this link:
http://en.wikipedia.org/wiki/Comparison_of_file_systems
Reading the table it appears that the biggest concern about filenames
is including a NULL byte.
Perhaps we're overthinking this whole thing? Maybe we can just write
filenames with weird characters and the sysadmin's have to muck around
with what happens when they have a design doc with weird characters?
Filesystems may allow UTF-8, but they still assign meaning to some
characters and/or sequences e.g. '/', '\', '..', sometimes ':', so you
have to worry about it. Using a folding solution introduces collision
possibilities, so you then have to deal with that - case insensitivity
being the most egregious example of folding (and a personal hatred).
Some characters are universally annoying to deal with at the command
line. Leading '-' and spaces being obvious examples.
There is no solution that doesn't involve some decision and/or code.
I'd support any solution that isn't ascii/english/roman-centric.
However - not allowing '/' removes the principal hierarchy indicator,
which is annoying. And placing a constraint on the document id of a
design document seems wrong, because the '_design/' prefix is a
filter, rather than a constraint, and IMO the rules covering document
ids should be uniform so that one can treat all documents identically
under transformation.
I like one directory per db, although an argument could be made that
in the current scheme, one directory is canonical data, and the other
is derived (effectively a cache).
Anyway, I could use my current solution and use a slug in which { non-
printing, space, /, \, ., leading -, : } are removed or folded to e.g.
_, suffixed with the hex MD5. How would this be? I could eliminate the
MD5 if the slug is the same as the name under case folding, which
would result in many filenames being identical to the name. It would
still have a directory structure as per my previous email, and in
particular the 'name' file would be remain, because it's needed to
implement all_databases with transformed names, and it allows
completely general scripting, albeit not as simple as filename globbing.
Given I've done the work to allow a full solution, and adding the slug
isn't hard ... ?
Alternatively, a workable lexical constraint would be: printable
unicode - { unicode uppercase, /, \, : } and !empty and !'..', but
obviously I'm not keen on that.
BTW: the technical problem remaining with the solution I published is
because Futon javascript collapses the design document id with the
view name within that document and treats it as a compound entity -
encoding/splitting etc is this more complicated. I've fixed some of
that, but work remains.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
The intuitive mind is a sacred gift and the rational mind is a
faithful servant. We have created a society that honours the servant
and has forgotten the gift.
-- Albert Einstein