On 8/14/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 14.08.2006, at 22:43, Stephen Deasey wrote:

>
> * Your clients don't understand utf-8, and you serve content in
> multiple languages which don't share a common character set.  Sucks to
> be you.
>

I think the whole purpose of that encooding mess is this above.
With time this will be less and less important, so how much
should we really care?
 From the technical perspective, it is nice to have universal
and general sulution, but from the practical side: it costs
time and money to keep it arround...


I agree. I was wondering if we should junk the whole mess, but I think
we can minimise the impact without loosing the ability to support
multiple encodings, and in fact improve the support.


>
> I've been working on some patches to fix the various bugs so don't
> worry about it too much.  But I'd appreciate feedback on how you
> actually use the encoding support.

I use it this way: leave everything as-is. I never had to
tweak any of the existing encoding (encoding) knobs. And I never had
anybody complaining. And we do serve japanese, chinese and
european languages. Allright, the client is always either IE
or Mozilla or Safari (prerequisite) so mine is perhaps not a
good example.


In the documents you serve, do you specify the encoding *within* the
document, at th etop of the HTML file for example? Or are you serving
XML in which case the default for that is utf-8 anyway (I think, off
the top of my head...).

Another possibility is that you happen to be using browsers which are
smart enough to reparse a document if it doesn't happen to be in the
encoding it first expected.  I think the big guys do this -- not sure
your mobile phone will be so forgiving.


Apropos chunked encoding: I still believe that the vectorized
IO is OK and the way you transform UTF8 on the fly is also OK.
So, if any content encoding has to take place, you can really
only do it with the chunked encoding OR by converting the whole
content in memory prior sending it, and giving the correct content
length OR by just omitting the content length all together.
I do not think there are other options.

I'm curious what you will come up with ;-)


I'll handle the IO stuff in a separate post. Here's something I wrote
up re encodings and such:





(This applies to case 3: supporting multiple encodings)


I agree with Zoran. ns_conn encoding should be the way to change the
encoding (input or output) at runtime.

The mime-type header sent back to the client does need to reflect the
encoding used, but ns_conn encoding should drive that, not the other
way around.

We can check the mime-type header for a charset declaration, and if
it's not there, add one for the current value of ns_conn encoding.

One problem to be resolved here is that Tcl encoding names do not
match up with HTTP charset names. HTTP talks about iso-8859-1, while
Tcl talks about iso8859-1. There are lookup routines to convert HTTP
charset names to Tcl encoding names, but not the other way around.
Tcl_GetEncodingName() returns the Tcl name for an encoding, not the
charset alias we used to get the encoding.

We could store the charset, as well as the encoding, for the conn. But
I was wondering: could we junk all the alias stuff and, in the
Naviserver install process, create a directory for encodings files and
fill it with symlinks to the real Tcl encoding files, unsing the
charset name?

You call ns_conn encoding with a charset. Naviserver converts the
charset name to a Tcl encoding name. The return value is the name of
the encoding, which is *not* the name of the charset you passed in! I
don't know if that's intended, but it's really confusing.

Another place this trips up: In the config for the tests Michael added:

 ns_section "ns/mimetypes"
 ns_param   .utf2utf_adp  "text/plain; charset=utf-8"
 ns_param   .iso2iso_adp  "text/plain; charset=iso-8859-1"

 ns_section "ns/encodings"
 ns_param   .utf2utf_adp  "utf-8"
 ns_param   .iso2iso_adp  "iso-8859-1"

The ns/encodings are the encoding to use to read an ADP file from
disk, accoring to extension. It solves the problem: the web designers
editor doesn't support utf-8.  (I wonder if this is still valid any
more?)

But, the code is actually expecting Tcl encoding names here, not a
charset, so this config is busted. It doesn't show up in the tests
because the only alternative encoding we're using is iso-8859-1, which
also happens to be the default.

This is probably just a bug. The code uses Ns_GetEncoding() when it
should use Ns_GetCharsetEncoding(). But that highlights another bug:
when would you ever want to call Ns_GetEncoding()? You always want to
take into account the charset aliases we carefully set up. This
probably shouldn't be a public function.


The strategy of driving the encoding from the mime-type has some other
problems.  You have to create a whole bunch of fake mime-types /
extension mappings just to support multiple encodings (the
ns/mimetypes above).

What if there is no extension? Or you want to keep the .adp (or
whatever) extension, but serve content in different encodings from
different parts of the URL tree? Currently you have to put code in
each ADP to set the mime-type (which is always the same) explicitly,
to set the charset as a side effect.

AOLserver 4.5 has a ns_register_encoding command, which is perhaps an
improvement on plain file extensions.

Both AOLserver and our current code base have the bug where the
character set is only set for mime-types of type text/*.  This makes a
certain ammount of sense -- you don't want to be text encoding a
dynamicaly generated gif, for example.

However, the correct mime-type for XHTML is application/xml+html. So
in this case, an ADP which generates XHTML will not have the correct
encoding applied if you're relying on the mime-type.


Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn
write_encoded. This is yet another way to sort-of-set the encoding,
which is essentially a single property.

The only code wich uses this is the ns_write command, and I think it
has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns
false, ns_write assumes it's writing binary data. But nothing actually
sets this flag, so ns_write doesn't encode text at all.

We should remove the WRITE_ENCODED stuff.

How do we handle binary data from Tcl anyway? There's a -binary switch
to ns_return, and the write_encoded flag for ns_write.  I was
wondering if we could just check the type of the Tcl object passed in
to any of the ns_return-like functions to see if it's type
"bytearray".  A byte array *could* get shimmered to a string, and then
back again without data losss, but that's probably unlikely in
practice.


There's also the problem of input encodings.  If you're supporting
multiple encodings, how do you know what encoding the query data is
in?  A couple of solutions suggested in Rob Mayoff's guide are to put
this in a hidden form field, or to put it in a cookie.

Here's an interesting bug: You need to get the character set a form
was encoded in, so you call ns_queryget ...  This first invokes the
legacy ns_getform call which, among other things, pulls an file upload
data out of the content stream and puts it into temp files.

Now, you have to assume *some* encoding to get at the query data in
the first place. So let's guess and say utf-8.  Uh oh, our hidden form
field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to
reset the encoding, and this call flushes the query data which was
previously decoded using utf-8.
It also flushes our uploaded files. The kicker here is uploaded files
aren't even decoded using a text encoding, so when the query data is
again decoded, this time using iso-8859-2, the the uploaded files will
be exactly the same as they were before.



I'm sure there's some more stuff I'm forgetting. Anyway, here's how I
think it should be:

* utf-8 by default
* mime-types are just mime-types
* always hack the mime-type for text data to add the charset
* text is anything sent via Ns_ConnReturnCharData()
* binary is a Tcl bytearray object
* static files are served as-is, text or binary
* multiple encodings are handled via calling ns_conn encoding
* folks need to do this manually. no more file extension magic


I think a nice way for folks to handle multiple encodings is to
register a filter, which you can of course use to simulate the file
extension scheme in place now, the AOLserver 4.5 ns_register_encoding
stuff, and more, because it's a filter. You can also do things like
check query data or cookies for the charset to use.


Questions that need answered:

* can we junk charset aliases in nsd/encodings.c and use a dir of symlinks?
* can we junk ns/encodings in 2006?
* is checking for bytearray type a good way to handle binary Tcl objects?
* does the above scheme handle all the requirements?


Bugs to fix:

* query data flushing is too extreme. don't flush files
* junk Ns_Conn*WriteEncodedFlag() / ns_conn write_encoded



There's also the content-length bug, but I think that's a separate
problem. I'm going to look more into that next as I wrote it, so if
anyone else want to tackle any of the above because they need it soon,
go ahead. If not, I'll do that next.

Reply via email to