Re: Bug#99933: second attempt at more comprehensive unicode policy

Jason Gunthorpe Mon, 06 Jan 2003 15:21:41 -0600

On 6 Jan 2003, Colin Walters wrote:

> Since we will have to change programs anyways, we might as well fix them
> to decode filenames as well.  The shell is kind of tempting as a "quick
> fix", but I don't think it will really help us.


Fixing progams that handle terminal input is a different matter IMHO, it's
something that should be decided on a more case by case basis, and alot of 
cases might be effortless handled just by extending ncurses/slang

I think the philosophy should be that everything should be converted to
UTF-8 after it is read from the terminal. Programs that interface with the
terminal need to convert.

Changing programs that handle terminal input is a far smaller scope than
changing every program that touches argv and every program that does
terminal input.

If this route is followed then a huge swath of programs are half correct
already, their only problem is that they will not be converting utf-8 for
display. That might be best handled through glibc (again, changing
*everything* just to get around the lack of utf-8 terminals is insane)

> > IMHO it can't work any other way. If for instance you have a directory
> > with some chinese utf-8 filenames and you do:
> > 
> > ls <typed filename in latin-1> * 
> > 
> > The only way ls ever has a hope of working is if it expects all of argv to
> > be utf-8. Basically, I don't see any way that ls could ever hope to do
> > automatic conversion and have a program that works in all cases. 

> Well, let's be clear; nothing we can do will truly work in all cases. 
> The vast majority of data is untagged, and charsets are not always
> reliably distinguishable.  We are just trying to minimize what breaks.

Well, that's not true. At the shell level everything is tagged. The shell
knows things returned from readdir are utf-8 and things typed into the
console are something else.

When I mean 'all cases' I mean the cases the come up in a system with only
UTF-8 names in the filesystem, not one that has mixed encodings already
in the filesystem, that's hopeless.

> For the case you named above, I think what should happen is that 'ls'
> converts all the arguments to UTF-8 for internal processing.  For the
> first argument, UTF-8 validation will fail, so ls will try converting
> from the locale's charset, which will work.  The rest of the arguments
> will validate as UTF-8, so ls just goes on its way.

Eww, that's gross, it isn't definate that UTF-8 validation will always
fail for non UTF-8 text, you could easially get lucky and type in a word
that is valid UTF-8, but needs conversion! That's a terribly subtle UI
bug.

> I don't think the shell does in all cases.  Think about when arguments
> are computed dynamically.

Consider the shell to be a scripting language just like python/java and
look at how it's handled there - all internal strings are UTF-8, functions
that read/write to the terminal convert automatically, functions exist to
convert arbitary text/files.

You have everything needed to make the shell work uniformly in any
environment, but some cases might require an iconv, but the iconv is
required for *all* users, not just those with different locale settings. I
think that's a good goal.

> Generally speaking, I think the shell should just be a conduit for
> bytes, and not modify them at all.  Much like 'cat'.

The trouble is, the shell interfaces with the terminal, so it is the only
thing in a position to know how to convert characters coming from the
terimal to UTF-8, nothing else can do this.

Jason

Re: Bug#99933: second attempt at more comprehensive unicode policy

Reply via email to