Re: RFC6265, cookie parsing and UTF-8

Mark Thomas Wed, 27 Aug 2014 02:30:07 -0700

On 26/08/2014 22:52, Filip Hanik wrote:
> On Tue, Aug 26, 2014 at 12:53 PM, Mark Thomas <ma...@apache.org> wrote:
> 
>> One of the aims of the proposed cookie changes [1] was to deal with the
>> HTML 5 changes that mean UTF-8 can appear in cookie headers.
>>
>> This has some potentially large implications for Tomcat.
>>
> 
> Since we already are in the 8.0.x release cycle, I, as an end user/system
> administrator, would expect parsing would remain 100% backwards compatible
> for version 8.0.x+n (n=1...)


+1

>> Currently, Tomcat handles cookies as MessageBytes, processing everything
>> in bytes and only converting to String when necessary. This is largely
>> possible because of the assumption that everything is ASCII.
>>
>> Introduce UTF-8 and processing everything in bytes gets a whole lot
>> harder. You essentially have to decode to UTF-8 to ensure that you have
>> valid data - at a which point why not just use Strings anyway?
>>
>> I am currently leaning towards removing a lot of the current cookie
>> header caching  recycling and doing something along the following lines:
>>
> 
> all that caching/recycling is to avoid GC cycles and was in the past a
> crucial performance optimization.
> back in those days, with the hardware that was available in 06-07, we were
> pushing a single Tomcat instance to 60k requests per second.
> creating new objects was painfully expensive at that rate.

I've done some work on reducing GC when Tomcat was being hammered with
large numbers of requests fairly recently so I agree this is an issue we
still need to keep an eye on.

>> - Lazy parsing as currently (but unless cookie based session tracking is
>>   disabled this is going to run on every request)
>>
> 
> but our cookies, JSESSIONID, doesn't have to be UTF-8, does it?
> this goes hand in hand with the SessionIdGenerator that Rainer just did,
> can that return UTF-8 values?
> So the lazy part can apply to all other cookies, meaning, don't parse it
> until the app requests it, just store the bytes and move on.

Good news: I don't believe the session IDs are UTF-8.

Bad news: The issue is that if there is a chance of UTF-8 in the header
then you can't simply split the header into individual cookies based on
the separator byte since you can't tell (without decoding to characters)
if a byte represents the separator or is part of a sequence of several
bytes representing some other character.

Aside: I think putting UTF-8 into HTTP headers is a crazy idea but that
ship has sailed and we have to deal with it.

>> - Convert headers to UTF-8 strings
>> 
>> - Parse them with a new parser along the lines of o.a.t.u.http.parser
>> - Have that parser return an array of javax.servlet.http.Cookie objects
>> - Pass those to the app if/when requested
>>
>> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
>> share as much code as possible and switch between them based on the
>> cookie header with the expectation that 99.9% of cookies will be parsed
>> by the RFC6265 parser. We could add some options to this switching to
>> enable other parsers (e.g. a Netscape parser) to be used.
>>
> 
> I like the idea of swappable parsers, with the default is the exact
> behavior you see now. I can see changing the default after some
> stabilization.

Same here.


>> I'd also like to keep the current cookie parsing implementation for now.
>> Until we are happy with the new parsing, the current implementation will
>> be the default. Once we are happy with the new parsing we can change the
>> default. We can add an option to switch between the current and the new
>> parsing.
>>
>> Thoughts?
>>
> 
> knock it out.

That is the plan :)

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Reply via email to