[ 
https://issues.apache.org/jira/browse/SOLR-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544774#comment-13544774
 ] 

Uwe Schindler commented on SOLR-4265:
-------------------------------------

bq. The URL is invalid in the HTML source but browsers will "fix" it for you 
(url-escape) and many people accept it as something ordinary.

Thats the whole problem, I agree. But we don't know what the browser used as 
encoding for encoding the URL. So without a charset given somewhere we cannot 
decode the URL. You have the problem with GET and POST requests. Because of 
this you cannot fix this for SOLR.

The patch does not change any behaviour in guessing charsets from before, the 
*only* change here is the encoding used to decode URLs (which is now "UTF-8" 
because several web containers handle this in a different way). Jetty and 
Tomcat both handled POST content respecting the charset of the POST BODY - and 
that did not change.

Where is your problem with Solr? The whole discussion could be flame-wared on 
the Jetty or Tomcat lists as before, unfortunately the HTTP spec and the 
Servlet spec and the URL spec are not precise enough. For Solr it is not an 
issue: Solr is documented to only accept URL encoded request parameters as 
UTF-8.

The only way to change this would be to do it like search engine. They allow to 
pass in an "ie=" extra GET parameter that defines the "input encoding" of the 
URL parameters. In that case you could do a 2-step URL parsing approach (or use 
commons-codec: decode the binary url from the byte[] and then interpret the 
"ie" parameter as US-ASCII and use it to decode the remaining parameters.
                
> Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL 
> parsing (Tomcat)
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4265
>                 URL: https://issues.apache.org/jira/browse/SOLR-4265
>             Project: Solr
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 4.0
>         Environment: Windows but, environment independent
>            Reporter: Alex Rocher
>            Assignee: Uwe Schindler
>         Attachments: CropperCapture[4].png, CropperCapture[5].png, 
> CropperCapture[6].png, SOLR-4265.patch, SOLR-4265.patch, SOLR-4265.patch, 
> SOLR-4265.patch, SOLR-4265.patch, SolrDispatchFilter.java.patch
>
>
> When you type an accent (in french language for example) in the console query 
> tester, there's no charset conversion (servlet request charset conversion)
> Eg.: "même" is converted into it's ISO-8859-1 representation ==> fail
> The reason : getCharacterEncoding from HTTPRequest is not tested. Il it's 
> null, il will assume to convert an UTF-8 encoding charset.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to