[jira] [Commented] (SOLR-4283) Improve URL decoding (followup of SOLR-4265)

Dawid Weiss (JIRA) Tue, 08 Jan 2013 01:04:15 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546715#comment-13546715
 ]


Dawid Weiss commented on SOLR-4283:
-----------------------------------

bq. Jetty uses uses UTF-8 to parse HTTP requests.

I don't think it should (or does); if it does though, it'd be weird because it 
should treat HTTP headers with US-ASCII and not interpret anything. 

I did the following experiment. I prepared a raw HTTP request with an "ł" 
character inside (Unicode U+0141, UTF-8 sequence: C5 82). Then I prepared the 
following simple JSP (servlet):
{code}
<%
  response.setContentType("text/plain; charset=UTF-8");
  String qs = request.getQueryString();
  if (qs == null) qs = "";
  for (char chr : qs.toCharArray()) {
    out.print(Integer.toHexString(chr));
    out.print(" ");
  }
  out.println();
%>
{code}

As you can see what it does is it hexdumps all the char values from the query 
string, as provided by the container. It is unfortunate that the container HAS 
to provide a String here because it needs to do some sort of input bytes->chars 
conversion. My opinion is that this should be an identity mapping of some sort; 
copying bytes to chars (with 0xff mask) will give you a rough equivalent of 
doing new String(inputQueryStringBytes, "ISO8859-1"). This string cannot be 
converted back to bytes with qs.getBytes("UTF-8") because these bytes are no 
longer the same characters (we don't know what they were, actually). Even if we 
assume they were proper UTF-8 they will no longer be converted as such. 

This is Tomcat, for example:
{code}
$ nc6 localhost 8080 < request.http
nc6: using stream socket
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=EFBEE4830C7B91D979938E82C9CAB6CB; Path=/test/; HttpOnly
Content-Type: text/plain;charset=UTF-8
Content-Length: 30
Date: Tue, 08 Jan 2013 08:35:41 GMT

61 61 c5 82 62 62
{code}

This follows my expectation of a HTTP server -- it didn't corrupt anything, it 
didn't *interpret* anything because it couldn't.

Now jetty9:
{code}
$ nc6 localhost 8080 < request.http
nc6: using stream socket
HTTP/1.1 200 OK
Date: Tue, 08 Jan 2013 08:42:49 GMT
Set-Cookie: JSESSIONID=1d4i8p9coo32y11wo09leai3iw;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 24
Server: Jetty(9.0.0.M3)

61 61 3f 62 62
{code}

This is wrong; 3f stands for '?' so clearly Jetty tries to interpret the input 
bytes from HTTP somehow and fails on doing so. What it does exactly -- I've no 
idea, would have to check the code.

With Jetty8, interestingly:
{code}
$ nc6 localhost 8080 < request.http
nc6: using stream socket
HTTP/1.1 200 OK
Date: Tue, 08 Jan 2013 08:57:56 GMT
Set-Cookie: JSESSIONID=12ivf935hblkk1452cmwpgntt9;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain;charset=UTF-8
Content-Length: 26
Server: Jetty(8.1.8.v20121106)

61 61 142 62 62
aałbb
{code}

Ha! So here's where Yonik claim that Jetty parses these UTF-8 URIs right came 
from...

I honestly think this is a BUG in Jetty8 though (with a regression to an even 
worse bug in Jetty9...). So I sustain my claim that the conversion:
{code}
queryString.getBytes("UTF-8")
{code}
to recover the input byte stream is incorrect. A fix? There doesn't seem to be 
one that would work for all the containers (for tomcat it'd be going back 
char-to-byte one by one and then parsing as UTF-8 for example, for Jetty8 it's 
already in UTF-8, for Jetty9 it's screwed up from the beginning).



                
> Improve URL decoding (followup of SOLR-4265)
> --------------------------------------------
>
>                 Key: SOLR-4283
>                 URL: https://issues.apache.org/jira/browse/SOLR-4283
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.1, 5.0
>
>         Attachments: index.jsp, request.http, SOLR-4283.patch, SOLR-4283.patch
>
>
> Followup of SOLR-4265:
> SOLR-4265 has 2 problems:
> - it reads the whole InputStream into a String and this one can be big. This 
> wastes memory, especially when your query string from the POSted form data is 
> near the 2 Megabyte limit. The String is then packed in splitted form into a 
> big Map.
> - it does not report corrupt UTF-8
> The attached patch will do 2 things:
> - The decoding of the POSTed form data is done on the ServletInputStream, 
> directly parsing the bytes (not chars). Key/Value pairs are extracted and 
> %-decoded to byte[] on the fly. URL-parameters from getQueryString() are 
> parsed with the same code using ByteArrayInputStream on the original String, 
> interpreted as UTF-8 (this is a hack, because Servlet API does not give back 
> the original bytes from the HTTP request). To be standards conform, the query 
> String should be interpreted as US-ASCII, but with this approach, not full 
> escaped UTF-8 from the HTTP request survive.
> - the byte[] key/value pairs are converted to Strings using CharsetDecoder
> This will be memory efficient and will report incorrect escaped form data, so 
> people will no longer complain if searches hit no results or similar.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4283) Improve URL decoding (followup of SOLR-4265)

Reply via email to