[ 
https://issues.apache.org/jira/browse/SOLR-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544644#comment-13544644
 ] 

Dawid Weiss commented on SOLR-4265:
-----------------------------------

bq. Yeah, it's actually "double encoded"... 

I know. I wasn't talking about the HTTP layer, I was referring to the servlet 
spec where the encoding for query string parsing in URIs is not defined and 
form-encoded POSTs are to use ISO-8859-1 by default. So neither Jetty nor 
Tomcat are "right" in how they handle parameter parsing from URIs, it was just 
underspecified from the beginning.

When you add the HTTP layer above things get even more confusing because, 
according to the spec, HTTP headers must be in US-ASCII. This includes the URI 
line so any character outside of US-ASCII (including your example) is a 
violation of the HTTP protocol. In practice (perhaps sadly) the rules and 
implementations are much more relaxed and just accept any bytes (until eols) 
(as per your curl example).

In theory it should be quite simple: if the encoding cannot be determined 
otherwise (from the origin page for forms, etc.) then it should be UTF-8 
encoded, then URL-escaped. In practice you get all the combinations of 
behaviors both on the browser/http client side and on the server side.

Talking about this -- I think I know what IE is trying to do. Since the URL 
you're typing in the browser is pretty much an indication where to send a GET 
request + the resource locator (URI) of the resource you're trying to access 
(which will end up in the HTTP header section), it's probably trying it's best 
to convert non-ASCII characters to their ASCII equivalents. So "ł" becomes "l" 
etc. If my explanation is anywhere near right then you need to admit it does 
make some sense... although it is absolutely useless and absurd in practical 
terms since any resource accessible via 'ł' in its path shouldn't be aliased to 
"l": (pl) "łał" -> (en) "whoa", "lał" -> ([he] was pissing)...

Talk about standards, eh? :)

                
> Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL 
> parsing (Tomcat)
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4265
>                 URL: https://issues.apache.org/jira/browse/SOLR-4265
>             Project: Solr
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 4.0
>         Environment: Windows but, environment independent
>            Reporter: Alex Rocher
>            Assignee: Uwe Schindler
>         Attachments: SOLR-4265.patch, SOLR-4265.patch, 
> SolrDispatchFilter.java.patch
>
>
> When you type an accent (in french language for example) in the console query 
> tester, there's no charset conversion (servlet request charset conversion)
> Eg.: "même" is converted into it's ISO-8859-1 representation ==> fail
> The reason : getCharacterEncoding from HTTPRequest is not tested. Il it's 
> null, il will assume to convert an UTF-8 encoding charset.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to