On 24 Jul 2008, at 09:06, Grzegorz Kossakowski wrote:
Jeremy Quinn pisze:
Hi All
Hi Jeremy! :-)
Hi Grzegorz, nice to hear from you :)
I am trying to solve a nasty request transcoding bug, that I found
while working on CForms.
Join the club! Discovered character encoding problems two days ago
in a project based on Cocoon 2.1.x. Tried to fight it yesterday and
gave up.
You work with 2.1 ?? I am shocked :)
AFAICS this bug effects older versions as well ..... accented
characters not roundtripping due to bad transcoding in Cocoon,
under certain circumstances.
CForms works in one of two modes: ajax-on and ajax-off.
When ajax is on, CForms submits the form via an XMLHttp Request
(XHR), when it is off it submits the form normally.
Servlet Requests are expected by default to be encoded using
ISO-8859-1 (appalling choice!!!), but of course to get any real
work done on the international web, you should use UTF-8 (now
Cocoon's default, thanks to Vadim).
When I was looking at our code in HttpEnvironment, HttpRequest and
in MultipartParser I started to wonder if it would be an option to
forget about any other encodings apart from UTF-8. According to my
knowledge, there is no serious software that does not support Unicode.
This would help us to clean up and simplify the code in trunk
greatly so it would go into 2.3 release (don't be afraid, you won't
need to wait for it years, I promise).
The only problem is that I don't have any significant experience
with such issues so I would like to hear if my proposal makes sense.
Would it be possible to support Unicode only?
A change like this while simplifying our codebase, could cause utter
havoc to users ..... I don't know if unicode really is a practical
superset of every other possible encoding.
Sorry, I do not think I know enough about this either.
Browsers should post data in the encoding of the page containing
the form.
Dojo always posts forms as UTF-8 when it does XHR, seemingly
regardless of the page encoding. Furthermore, the post has a
Content-Type header : "application/x-www-form-urlencoded;
charset=UTF-8". (Default in FireFox3, can be set in Safari, unknown
in MSIE).
Jetty responds properly to the Content-Type header, by
automatically using that charset for decoding Request Parameters
instead of the default ISO-8859-1. (behaviour of other
ServletEngines unknown). This leads to a transcoding bug because
Cocoon assumes ISO-8859-1.
I think that behaviour of Jetty is correct. Right?
It /seems/ right ....
When forms are submitted normally (i.e. non-XHR) they usually do
not contain the Content-Type header (tested with FireFox3 & Safari)
and it does not seem possible to set one from JavaScript (XHR has
the api to do it).
So unless the user has set a different encoding for the
serialisation of their forms, CForms Requests will always be in
UTF-8, but the Content-Type header will not always specify this.
If the Content-Type header contains a charset, (at least in Jetty)
no further transcoding should happen. If it does not contain a
charset, the encoding will be default and parameters must be
transcoded.
So, if the header is correctly set, Cocoon's transcoding hack
(o.a.c.environment.http.HttpRequest.decode) breaks, because it
assumes standard ISO-8859-1.
Therefore we face the situation where it is impossible to get
correct decoding via settings in web.xml : "container-encoding" and
"form-encoding"
that work for both ajax-on and ajax-off forms from the same
instance of Cocoon.
But I have a solution I think :)
I propose that the default settings in Cocoon's web.xml for
"container-encoding" and "form-encoding" should be :
container-encoding : ISO-8859-1
- meaning: my servlet container uses this as it's default encoding
(unless some modern browser tells it different)
form-encoding : UTF-8
- meaning: this is Cocoon's default encoding for forms
Make this change to o.a.c.environment.http.HttpEnvironment's
constructor :
change :
this.request.setCharacterEncoding(defaultFormEncoding);
this.request.setContainerEncoding(containerEncoding);
to:
if (req.getCharacterEncoding() == null) { // use the value from
web.xml
this.request.setContainerEncoding(containerEncoding != null ?
containerEncoding : "ISO-8859-1");
} else { // use what we have been given
this.request.setContainerEncoding(req.getCharacterEncoding());
}
this.request.setCharacterEncoding(defaultFormEncoding != null ?
defaultFormEncoding : "UTF-8");
Then cleanup o.a.c.environment.http.HttpRequest methods :
public String getParameter(String name) {
String value = this.req.getParameter(name);
if (!this.container_encoding.equals(this.form_encoding)) {
value = decode(value);
}
return value;
}
private String decode(String str) {
if (str == null) return null;
try {
byte[] bytes = str.getBytes(this.container_encoding);
return new String(bytes, this.form_encoding);
} catch (UnsupportedEncodingException uee) {
throw new CascadingRuntimeException("Unsupported Encoding
Exception", uee);
}
}
public String[] getParameterValues(String name) {
String[] values = this.req.getParameterValues(name);
if (values == null) return null;
if (this.container_encoding.equals(this.form_encoding)) {
return values;
}
String[] decoded_values = new String[values.length];
for (int i = 0; i < values.length; ++i) {
decoded_values[i] = decode(values[i]);
}
return decoded_values;
}
So we only guess at the encoding, if we really don't know what it is.
My understanding is that TomCat also returns null for
getCharacterEncoding() if the default encoding is being used, but I
do not know if it responds properly to a Content-Type header with a
charset in it.
My guess is that either browsers sending proper Content-Type (with
a charset) and/or ServletEngines responding properly to it, must be
a relatively recent development.
This is not tested outside of :
MacOSX, FireFox3, Safari, Jetty
If you have got this far, and would be willing to test this in
other environments, it would be most helpful.
The code responsible for all these conversions is a really old one
so I guess will need to check it again.
Before I start to test your proposal I'll add a little bit of
complexity to your picture. You seem to forgot about other data
encodings like multipart/form-data. If you enable it by setting:
<form enctype="multipart/form-data" ...>
Then browser will encode form data using completely different
method. As you probably guess, then problems occur as well.
Yes, I was expecting that.
Upgrading CForms upload widget is on my long list ..... I guess you
just bumped it forward a few places :)
There is also maybe work to do in the portal .... Carsten? ;)
Our own problem with multipart/form-data is that file names of
uploaded files are not correctly decoded. You can easily check it
using following sample in Cocoon:
http://cocoon.zones.apache.org/demos/trunk/samples/forms/upload
(try this sample with ajax mode on and off and with non-latin
characters both in file name and in a text field)
There is even bug report about this issue:
https://issues.apache.org/jira/browse/COCOON-1917
Another interesting option would be to replace our own handling of
multipart requests with commons-upload code, see:
https://issues.apache.org/jira/browse/COCOON-1325
What do you think about the last proposal?
I need a bit of time to dig into this .....
Now I'm going to test fix proposed by you...
Many thanks!
regards Jeremy