[PROPOSAL] Explicitly-set the request character encoding when it has been committed

Christopher Schultz Thu, 01 Apr 2021 09:09:11 -0700

All,

Yesterday, I struggled to determine why my application was behavingdifferently than it had been in the past, and the problem turned out tobe that I had inserted a <filter> in the filter-chain before myCharacterEncodingFilter. My new <filter> was reading arequest-parameter. The CharacterEncodingFilter looks like this:


    public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
        throws IOException, ServletException
    {
        request.setCharacterEncoding(getCharacterEncoding(request));

        chain.doFilter(request, response);
    }

    protected String getCharacterEncoding(ServletRequest request)
    {
        String charset=request.getCharacterEncoding();

        if(null == charset)
            return this.getDefaultEncoding();
        else
            return charset;
    }

The "default encoding" is essentially always UTF-8.

When looking at the request in my servlet, the character encodingreported by the request was "UTF-8", but the actual encoding usedappeared to be ISO-8859-1 (the protocol default).

It looks like Tomcat is defaulting to ISO-8859-1 but continuing toreturn null for request.getCharacterEncoding().

My proposal is to have Tomcat set the request encoding field to"ISO-8859-1" in the following situation:


1. The character encoding is null

2. A method is called which requires that the character encoding be"committed"

Once that charset is determined, changing the request's charset has noeffect whatsoever other than to confuse the application developer.

If Tomcat were to explicitly-set that encoding, myCharacterEncodingFilter wouldn't detect null and override that requestcharset with "UTF-8", thereby lying to the rest of the application.

I might even go so far as to propose that callingrequest.setCharacterEncoding() after the encoding has been committedshould throw IllegalStateException. (This may be a violation of thespec, as the javadoc only declares UnsupportedEncodingException).


The javadoc state:

"

Overrides the name of the character encoding used in the body of thisrequest. This method must be called prior to reading request parametersor reading input using getReader(). *Otherwise, it has no effect.*

"

(emphasis mine)

In Tomcat, if you call setCharacterEncoding("UTF-8") after requestparameters have been read, there *is* an effect: you change the value ofthe charset field in the response.


This is the current implementation of setCharacterEncoding:

    /**
     * Overrides the name of the character encoding used in the body of
     * this request.  This method must be called prior to reading request
     * parameters or reading input using <code>getReader()</code>.
     *
     * @param enc The character encoding to be used
     *
     * @exception UnsupportedEncodingException if the specified encoding
     *  is not supported
     *
     * @since Servlet 2.3
     */

public void setCharacterEncoding(String enc) throwsUnsupportedEncodingException {


        if (usingReader) {
            return;
        }

        // Confirm that the encoding name is valid
        Charset charset = B2CConverter.getCharset(enc);

        // Save the validated encoding
        coyoteRequest.setCharset(charset);
    }

The javadoc says that it must be called before reading any requestparameters OR calling getReader() but there is only a check for the reader.


Maybe we should change the check to:

        if (usingReader || parametersParsed) {
            return;
        }

And also, change Request.parseParameters to add:

// getCharacterEncoding() may have been overridden tosearch for

            // hidden form field containing request encoding
            Charset charset = getCharset();

            // Add this line, here:
            coyoteRequest.setCharset(charset);

It is at this point that the character set is truly committed, at leastwhen parsing parameters.

In getReader, we have similar logic, where we set the character set forthe request after determining what it actually is:


        if (coyoteRequest.getCharacterEncoding() == null) {
            // Nothing currently set explicitly.
            // Check the content
            Context context = getContext();
            if (context != null) {
                String enc = context.getRequestCharacterEncoding();
                if (enc != null) {

// Explicitly set the context default so it isvisible to

                    // InputBuffer when creating the Reader.
                    setCharacterEncoding(enc);
                }
            }
        }

I'm open to any or all of the above, but I think something should bedone. It was surprising to see that the request's charset was "UTF-8"and yet the UTF-8 bytes sent to the container were coming out mangled inthe application.

If the application were to see null coming back fromrequest.getCharacterEncoding() would have at least giving a clue as towhat was happening. Since the effective charset being used wasISO-8859-1, it would have been better to return "ISO-8859-1" instead of"null" knowing that the parameters had *already* been interpreted usingthat character set.


Thanks,
-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

[PROPOSAL] Explicitly-set the request character encoding when it has been committed

Reply via email to