All,

Yesterday, I struggled to determine why my application was behaving differently than it had been in the past, and the problem turned out to be that I had inserted a <filter> in the filter-chain before my CharacterEncodingFilter. My new <filter> was reading a request-parameter. The CharacterEncodingFilter looks like this:

    public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
        throws IOException, ServletException
    {
        request.setCharacterEncoding(getCharacterEncoding(request));

        chain.doFilter(request, response);
    }

    protected String getCharacterEncoding(ServletRequest request)
    {
        String charset=request.getCharacterEncoding();

        if(null == charset)
            return this.getDefaultEncoding();
        else
            return charset;
    }

The "default encoding" is essentially always UTF-8.

When looking at the request in my servlet, the character encoding reported by the request was "UTF-8", but the actual encoding used appeared to be ISO-8859-1 (the protocol default).

It looks like Tomcat is defaulting to ISO-8859-1 but continuing to return null for request.getCharacterEncoding().

My proposal is to have Tomcat set the request encoding field to "ISO-8859-1" in the following situation:

1. The character encoding is null
2. A method is called which requires that the character encoding be "committed"

Once that charset is determined, changing the request's charset has no effect whatsoever other than to confuse the application developer.

If Tomcat were to explicitly-set that encoding, my CharacterEncodingFilter wouldn't detect null and override that request charset with "UTF-8", thereby lying to the rest of the application.

I might even go so far as to propose that calling request.setCharacterEncoding() after the encoding has been committed should throw IllegalStateException. (This may be a violation of the spec, as the javadoc only declares UnsupportedEncodingException).

The javadoc state:

"
Overrides the name of the character encoding used in the body of this request. This method must be called prior to reading request parameters or reading input using getReader(). *Otherwise, it has no effect.*
"

(emphasis mine)

In Tomcat, if you call setCharacterEncoding("UTF-8") after request parameters have been read, there *is* an effect: you change the value of the charset field in the response.

This is the current implementation of setCharacterEncoding:

    /**
     * Overrides the name of the character encoding used in the body of
     * this request.  This method must be called prior to reading request
     * parameters or reading input using <code>getReader()</code>.
     *
     * @param enc The character encoding to be used
     *
     * @exception UnsupportedEncodingException if the specified encoding
     *  is not supported
     *
     * @since Servlet 2.3
     */
public void setCharacterEncoding(String enc) throws UnsupportedEncodingException {

        if (usingReader) {
            return;
        }

        // Confirm that the encoding name is valid
        Charset charset = B2CConverter.getCharset(enc);

        // Save the validated encoding
        coyoteRequest.setCharset(charset);
    }

The javadoc says that it must be called before reading any request parameters OR calling getReader() but there is only a check for the reader.

Maybe we should change the check to:

        if (usingReader || parametersParsed) {
            return;
        }

And also, change Request.parseParameters to add:

// getCharacterEncoding() may have been overridden to search for
            // hidden form field containing request encoding
            Charset charset = getCharset();

            // Add this line, here:
            coyoteRequest.setCharset(charset);

It is at this point that the character set is truly committed, at least when parsing parameters.

In getReader, we have similar logic, where we set the character set for the request after determining what it actually is:

        if (coyoteRequest.getCharacterEncoding() == null) {
            // Nothing currently set explicitly.
            // Check the content
            Context context = getContext();
            if (context != null) {
                String enc = context.getRequestCharacterEncoding();
                if (enc != null) {
// Explicitly set the context default so it is visible to
                    // InputBuffer when creating the Reader.
                    setCharacterEncoding(enc);
                }
            }
        }

I'm open to any or all of the above, but I think something should be done. It was surprising to see that the request's charset was "UTF-8" and yet the UTF-8 bytes sent to the container were coming out mangled in the application.

If the application were to see null coming back from request.getCharacterEncoding() would have at least giving a clue as to what was happening. Since the effective charset being used was ISO-8859-1, it would have been better to return "ISO-8859-1" instead of "null" knowing that the parameters had *already* been interpreted using that character set.

Thanks,
-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to