Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tomcat Wiki" for change 
notification.

The "FAQ/CharacterEncoding" page has been changed by GarretWilson:
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding?action=diff&rev1=26&rev2=27

Comment:
Updated sections related to percent encoding charset of HTML form posts.

  
  <<Anchor(Q9)>>'''Why does everything have to be this way?'''
  
- Everything covered in this page comes down to practical interpretation of a 
number of specifications. When working with Java servlets, the Java Servlet 
Specification is the primary reference, but the servlet spec itself relies on 
older specifications such as HTTP for its foundation. Here are a couple of 
references before we cover exactly where these items are located in them.
+ Everything covered in this page comes down to practical interpretation of a 
number of specifications. When working with Java servlets, the Java Servlet 
Specification is the primary reference, but the servlet spec itself relies on 
older specifications such as HTTP for its foundation. Here are a couple of 
references before we cover exactly where these items are located in them. A 
more detailed list can be found on the 
[[https://wiki.apache.org/tomcat/Specifications|Specifications]] page.
  
+  1. [[https://www.jcp.org/en/jsr/detail?id=369|Java Servlet Specification 
4.0]]
+  1. [[https://tools.ietf.org/html/rfc7230|HTTP 1.1 Protocol: Message Syntax 
and Routing]], [[https://tools.ietf.org/html/rfc7231|HTTP 1.1 Protocol: 
Semantics and Content]] …
-  1. [[http://jcp.org/aboutJava/communityprocess/mrel/jsr154/index2.html|Java 
Servlet Specification 2.5]]
-  1. [[http://jcp.org/aboutJava/communityprocess/final/jsr154/index.html|Java 
Servlet Specification 2.4]]
-  1. [[http://www.w3.org/Protocols/rfc2616/rfc2616.txt|HTTP 1.1 Protocol]] 
([[http://www.w3.org/Protocols/rfc2616/rfc2616.html|hyperlinked version]])
-  1. [[http://www.ietf.org/rfc/rfc2396.txt|URI Syntax]]
+  1. [[https://tools.ietf.org/html/rfc3986|URI Syntax]]
-  1. [[http://www.w3.org/Protocols/rfc822/|ARPA Internet Text Messages]]
+  1. [[https://tools.ietf.org/html/rfc822|ARPA Internet Text Messages]]
-  1. [[http://www.w3.org/TR/html4|HTML 4]]
+  1. [[https://www.w3.org/TR/html4/|HTML 4]], 
[[https://www.w3.org/TR/html/|HTML 5]]
  
  ''Default encoding for request and response bodies''
  
@@ -47, +46 @@

  
  ''Default encoding for GET''
  
- The character set for HTTP query strings (that's the technical term for 'GET 
parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. 
The character set is defined to be 
[[http://en.wikipedia.org/wiki/ASCII|US-ASCII]]. Any character that does not 
map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax 
specification says that characters outside of US-ASCII must be encoded using 
`%` escape sequences: each character is encoded as a literal `%` followed by 
the two hexadecimal codes which indicate its character code. Thus, `a` 
(US-ASCII character code 97 = 0x61) is equivalent to `%61`. There ''is no 
default encoding for URIs'' specified anywhere, which is why there is a lot of 
confusion when it comes to decoding these values.
+ The character set for HTTP query strings (that's the technical term for 'GET 
parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. 
The character set is defined to be 
[[http://en.wikipedia.org/wiki/ASCII|US-ASCII]]. Any character that does not 
map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax 
specification says that characters outside of US-ASCII must be encoded using 
`%` escape sequences: each character is encoded as a literal `%` followed by 
the two hexadecimal codes which indicate its character code. Thus, `a` 
(US-ASCII character code 97 = 0x61) is equivalent to `%61`. Although the URI 
specification does not mandate a default encoding for percent-encoded octets, 
it recommends UTF-8 especially for new URI schemes, and most modern user agents 
have settled on UTF-8 for percent-encoding URI characters.
  
  Some notes about the character encoding of URIs:
-  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so 
they are often used interchangeably. Most of the web uses ISO-8859-1 as the 
default for query strings.
+  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so 
they are often used interchangeably.
-  1. Many browsers are starting to offer (default) options of encoding URIs 
using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of 
the current page to encode URIs for links (see the note above regarding browser 
behavior for POST encoding).
+  1. Modern browsers encoding URIs using UTF-8. Some browsers appear to use 
the encoding of the current page to encode URIs for links.
-  1. [[http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars|HTML 
4.0]] recommends the use of UTF-8 to encode the query string.
+  1. [[https://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars|HTML 
4.0]] recommends the use of UTF-8 to encode the query string.
   1. When in doubt, use POST for any data you think might have problems 
surviving a trip through the query string.
  
  ''Default Encoding for POST''
  
- [[http://en.wikipedia.org/wiki/Iso-8859-1|ISO-8859-1]] is defined as the 
default character set for HTTP request and response bodies in the servlet 
specification (request encoding: section 4.9 for spec version 2.4, section 3.9 
for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 
and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of 
the HTTP/1.1 specification.
+ Older versions of the HTTP/1.1 specification (e.g. 
[[https://tools.ietf.org/html/rfc2616|RFC 2616]]) indicated that 
[[https://en.wikipedia.org/wiki/ISO/IEC_8859-1|ISO-8859-1]] is the default 
charset for text-based HTTP request and response bodies if no charset is 
indicated. Although [[https://tools.ietf.org/html/rfc7231|RFC 7231]] removed 
this default, the servlet specification continues to follow suit. Thus the 
servlet specification indicates that if a `POST` request does not indicate an 
encoding, it must be processed as `ISO-8859-1`, except for 
`application/x-www-form-urlencoded`, which by default should be interpreted as 
```US-ASCII` (as it by definition should contain only characters within the 
ASCII range to begin with).
  
  Some notes about the character encoding of a POST request:
-  1. Section 3.4.1 of HTTP/1.1 states that recipients of an HTTP message 
''must'' respect the character encoding specified by the sender in the 
`Content-Type` header if the encoding is supported. A missing character allows 
the recipient to "guess" what encoding is appropriate.
+  1. RFC 2616 Section 3.4.1 stated that recipients of an HTTP message ''must'' 
respect the character encoding specified by the sender in the `Content-Type` 
header if the encoding is supported. A missing character allows the recipient 
to "guess" what encoding is appropriate.
   1. Most web browsers today ''do not'' specify the character set of a 
request, even when it is something other than ISO-8859-1. This seems to be in 
violation of the HTTP specification. Most web browsers appear to send a request 
body using the encoding of the page used to generate the POST (for instance, 
the <form> element came from a page with a specific encoding... it is ''that'' 
encoding which is used to submit the POST data for that form).
+ 
+ ''Percent Encoding for `application/x-www-form-urlencoded`''
+ 
+ The [[https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1|HTML 
4.0.1]] specification indicated that percent-encoding of non-ASCII characters 
of `application/x-www-form-urlencoded` (the default content type for HTML form 
submissions) should be performed using `US-ASCII` byte sequences. However 
[[https://url.spec.whatwg.org/#concept-urlencoded-serializer|HTML 5]] changed 
this to use UTF-8 byte sequences, matching the modern percent encoding for 
URLs. Modern browsers therefore percent-encode UTF-8 sequences when submitting 
forms using `application/x-www-form-urlencoded`.
+ 
+ The servlet specification, however, requires servlet containers to interpret 
percent-encoded sequences in `application/x-www-form-urlencoded` as 
`ISO-8859-1`, which in a default configuration will result in corrupted content 
because of the charset mismatch. See below for how this can be reconfigured in 
Tomcat.
+ 
  
  ''HTTP Headers''
  
@@ -89, +95 @@

  
  <<Anchor(Q3)>>'''How do I change how POST parameters are interpreted?'''
  
- POST requests should specify the encoding of the parameters and values they 
send. Since many clients fail to set an explicit encoding, the default is used 
(ISO-8859-1). In many cases this is not the preferred interpretation so one can 
employ a javax.servlet.Filter to set request encodings. Writing such a filter 
is trivial.
+ `POST` requests should specify the encoding of the parameters and values they 
send. Since many clients fail to set an explicit encoding, the default is used 
is `US-ASCII` for `application/x-www-form-urlencoded` and `ISO-8859-1` for all 
other content types.
+ 
+ In addition, the servlet specification requires that percent-encoded 
sequences of `application/x-www-form-urlencoded` be interpreted as `ISO-8859-1` 
by default which, as explained above, does not match the HTML 5 specification 
and modern user agent practice of using UTF-8 to percent encode characters. 
Nevertheless the servlet specification requires the servlet container's 
interpretation of percent-encoded sequences of 
`application/x-www-form-urlencoded` to follow any configured character 
encoding. Thus appropriate intepretation of `application/x-www-form-urlencoded` 
byte sequences can be achieved by setting the request character encoding to 
`UTF-8`.
+ 
+ The container-agnostic approach for specifying the request character encoding 
is to set the `<request-character-encoding>` element in the web application 
`web.xml` file:
+ 
+ {{{<request-character-encoding>UTF-8</request-character-encoding>}}}
+ 
+ '''''Note''''': If you are using the Eclipse integrated development 
environment, as of Eclipse Enterprise Java Developers 2019-03 M1 (4.11.0 M1) 
the IDE does not recognize the `<request-character-encoding>` setting and will 
temporarily freeze the IDE and generate errors with any edit of web application 
files. You can track the latest status of this problem at 
[[https://bugs.eclipse.org/bugs/show_bug.cgi?id=543377|Eclipse Bug 543377]].
+ 
+ Otherwise one can employ a `javax.servlet.Filter`. Writing such a filter is 
trivial.
   6.x, 7.x::
+ Tomcat already comes with such an example filter. Please take a look at 
`webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java`.
- Tomcat already comes with such an example filter. Please take a look at:
- {{{
- webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
- }}}
   5.5.36+, 6.0.36+, 7.0.20+, 8.x::
- Since Tomcat 7.0.20, 6.0.36 and 5.5.36 the filter became first-class citizen 
and was moved from the examples into core Tomcat and is available to any web 
application without the need to compile and bundle it separately.
+ Since Tomcat 7.0.20, 6.0.36 and 5.5.36 the filter became first-class citizen 
and was moved from the examples into core Tomcat and is available to any web 
application without the need to compile and bundle it separately, although this 
will not allow the web application to be deployed in non-Tomcat servlet 
containers that do not have this filter available, if the servlet is defined in 
the web application's own `web-xml` file.
- See documentation for the list of 
[[http://tomcat.apache.org/tomcat-8.0-doc/config/filter.html|filters]] provided 
by Tomcat. The class name is:
+ See documentation for the list of 
[[http://tomcat.apache.org/tomcat-8.0-doc/config/filter.html|filters]] provided 
by Tomcat. The class name is 
`org.apache.catalina.filters.SetCharacterEncodingFilter`.
+ 
+ It is also possible to define such a filter in the Tomcat installation 
configuration file `conf/web.xml`, which would set the request character 
encoding across all web applications without the need for any `web.xml` 
modifications. In fact the latest Tomcat versions come with sections in 
`web.xml` that already configure a filter to set the request character encoding 
to `UTF-8`. Simply edit `conf/web.xml` and uncomment both the definition and 
the mapping of the filter named `setCharacterEncodingFilter`.
- {{{
- org.apache.catalina.filters.SetCharacterEncodingFilter
- }}}
  
  '''''Note''''': The request encoding setting is effective only if it is done 
earlier than parameters are parsed. Once parsing happens, there is no way back. 
Parameters parsing is triggered by the first method that asks for parameter 
name or value. Make sure that the filter is positioned before any other filters 
that ask for request parameters. The positioning depends on the order of 
`filter-mapping` declarations in the WEB-INF/web.xml file, though since Servlet 
3.0 specification there are additional options to control the order. To check 
the actual order you can throw an Exception from your page and check its stack 
trace for filter names.
  

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to