All,
Yesterday, I struggled to determine why my application was behaving
differently than it had been in the past, and the problem turned out to
be that I had inserted a <filter> in the filter-chain before my
CharacterEncodingFilter. My new <filter> was reading a
request-parameter. The CharacterEncodingFilter looks like this:
public void doFilter(ServletRequest request,
ServletResponse response,
FilterChain chain)
throws IOException, ServletException
{
request.setCharacterEncoding(getCharacterEncoding(request));
chain.doFilter(request, response);
}
protected String getCharacterEncoding(ServletRequest request)
{
String charset=request.getCharacterEncoding();
if(null == charset)
return this.getDefaultEncoding();
else
return charset;
}
The "default encoding" is essentially always UTF-8.
When looking at the request in my servlet, the character encoding
reported by the request was "UTF-8", but the actual encoding used
appeared to be ISO-8859-1 (the protocol default).
It looks like Tomcat is defaulting to ISO-8859-1 but continuing to
return null for request.getCharacterEncoding().
My proposal is to have Tomcat set the request encoding field to
"ISO-8859-1" in the following situation:
1. The character encoding is null
2. A method is called which requires that the character encoding be
"committed"
Once that charset is determined, changing the request's charset has no
effect whatsoever other than to confuse the application developer.
If Tomcat were to explicitly-set that encoding, my
CharacterEncodingFilter wouldn't detect null and override that request
charset with "UTF-8", thereby lying to the rest of the application.
I might even go so far as to propose that calling
request.setCharacterEncoding() after the encoding has been committed
should throw IllegalStateException. (This may be a violation of the
spec, as the javadoc only declares UnsupportedEncodingException).
The javadoc state:
"
Overrides the name of the character encoding used in the body of this
request. This method must be called prior to reading request parameters
or reading input using getReader(). *Otherwise, it has no effect.*
"
(emphasis mine)
In Tomcat, if you call setCharacterEncoding("UTF-8") after request
parameters have been read, there *is* an effect: you change the value of
the charset field in the response.
This is the current implementation of setCharacterEncoding:
/**
* Overrides the name of the character encoding used in the body of
* this request. This method must be called prior to reading request
* parameters or reading input using <code>getReader()</code>.
*
* @param enc The character encoding to be used
*
* @exception UnsupportedEncodingException if the specified encoding
* is not supported
*
* @since Servlet 2.3
*/
public void setCharacterEncoding(String enc) throws
UnsupportedEncodingException {
if (usingReader) {
return;
}
// Confirm that the encoding name is valid
Charset charset = B2CConverter.getCharset(enc);
// Save the validated encoding
coyoteRequest.setCharset(charset);
}
The javadoc says that it must be called before reading any request
parameters OR calling getReader() but there is only a check for the reader.
Maybe we should change the check to:
if (usingReader || parametersParsed) {
return;
}
And also, change Request.parseParameters to add:
// getCharacterEncoding() may have been overridden to
search for
// hidden form field containing request encoding
Charset charset = getCharset();
// Add this line, here:
coyoteRequest.setCharset(charset);
It is at this point that the character set is truly committed, at least
when parsing parameters.
In getReader, we have similar logic, where we set the character set for
the request after determining what it actually is:
if (coyoteRequest.getCharacterEncoding() == null) {
// Nothing currently set explicitly.
// Check the content
Context context = getContext();
if (context != null) {
String enc = context.getRequestCharacterEncoding();
if (enc != null) {
// Explicitly set the context default so it is
visible to
// InputBuffer when creating the Reader.
setCharacterEncoding(enc);
}
}
}
I'm open to any or all of the above, but I think something should be
done. It was surprising to see that the request's charset was "UTF-8"
and yet the UTF-8 bytes sent to the container were coming out mangled in
the application.
If the application were to see null coming back from
request.getCharacterEncoding() would have at least giving a clue as to
what was happening. Since the effective charset being used was
ISO-8859-1, it would have been better to return "ISO-8859-1" instead of
"null" knowing that the parameters had *already* been interpreted using
that character set.
Thanks,
-chris
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org