Hi All

I am trying to solve a nasty request transcoding bug, that I found while working on CForms.

AFAICS this bug effects older versions as well ..... accented characters not roundtripping due to bad transcoding in Cocoon, under certain circumstances.

CForms works in one of two modes: ajax-on and ajax-off.
When ajax is on, CForms submits the form via an XMLHttp Request (XHR), when it is off it submits the form normally.

Servlet Requests are expected by default to be encoded using ISO-8859-1 (appalling choice!!!), but of course to get any real work done on the international web, you should use UTF-8 (now Cocoon's default, thanks to Vadim).

Browsers should post data in the encoding of the page containing the form.

Dojo always posts forms as UTF-8 when it does XHR, seemingly regardless of the page encoding. Furthermore, the post has a Content- Type header : "application/x-www-form-urlencoded; charset=UTF-8". (Default in FireFox3, can be set in Safari, unknown in MSIE).

Jetty responds properly to the Content-Type header, by automatically using that charset for decoding Request Parameters instead of the default ISO-8859-1. (behaviour of other ServletEngines unknown). This leads to a transcoding bug because Cocoon assumes ISO-8859-1.

When forms are submitted normally (i.e. non-XHR) they usually do not contain the Content-Type header (tested with FireFox3 & Safari) and it does not seem possible to set one from JavaScript (XHR has the api to do it).

So unless the user has set a different encoding for the serialisation of their forms, CForms Requests will always be in UTF-8, but the Content-Type header will not always specify this.

If the Content-Type header contains a charset, (at least in Jetty) no further transcoding should happen. If it does not contain a charset, the encoding will be default and parameters must be transcoded.

So, if the header is correctly set, Cocoon's transcoding hack (o.a.c.environment.http.HttpRequest.decode) breaks, because it assumes standard ISO-8859-1.

Therefore we face the situation where it is impossible to get correct decoding via settings in web.xml : "container-encoding" and "form- encoding" that work for both ajax-on and ajax-off forms from the same instance of Cocoon.

But I have a solution I think :)

I propose that the default settings in Cocoon's web.xml for "container- encoding" and "form-encoding" should be :
container-encoding : ISO-8859-1
    - meaning: my servlet container uses this as it's default encoding
      (unless some modern browser tells it different)
form-encoding : UTF-8
    - meaning: this is Cocoon's default encoding for forms

Make this change to o.a.c.environment.http.HttpEnvironment's constructor :
change :
this.request.setCharacterEncoding(defaultFormEncoding);
this.request.setContainerEncoding(containerEncoding);

to:
if (req.getCharacterEncoding() == null) { // use the value from web.xml
this.request.setContainerEncoding(containerEncoding != null ? containerEncoding : "ISO-8859-1");
} else { // use what we have been given
    this.request.setContainerEncoding(req.getCharacterEncoding());
}
this.request.setCharacterEncoding(defaultFormEncoding != null ? defaultFormEncoding : "UTF-8");

Then cleanup o.a.c.environment.http.HttpRequest methods :

public String getParameter(String name) {
    String value = this.req.getParameter(name);
    if (!this.container_encoding.equals(this.form_encoding)) {
        value = decode(value);
    }
    return value;
}

private String decode(String str) {
    if (str == null) return null;
    try {
        byte[] bytes = str.getBytes(this.container_encoding);
        return new String(bytes, this.form_encoding);
    } catch (UnsupportedEncodingException uee) {
throw new CascadingRuntimeException("Unsupported Encoding Exception", uee);
    }
}

public String[] getParameterValues(String name) {
    String[] values = this.req.getParameterValues(name);
    if (values == null) return null;
    if (this.container_encoding.equals(this.form_encoding)) {
        return values;
    }
    String[] decoded_values = new String[values.length];
    for (int i = 0; i < values.length; ++i) {
        decoded_values[i] = decode(values[i]);
    }
    return decoded_values;
}

So we only guess at the encoding, if we really don't know what it is.

My understanding is that TomCat also returns null for getCharacterEncoding() if the default encoding is being used, but I do not know if it responds properly to a Content-Type header with a charset in it.

My guess is that either browsers sending proper Content-Type (with a charset) and/or ServletEngines responding properly to it, must be a relatively recent development.

This is not tested outside of :
        MacOSX, FireFox3, Safari, Jetty

If you have got this far, and would be willing to test this in other environments, it would be most helpful.


best regards

Jeremy


Reply via email to