When I say "incomplete utf8 sequence" I mean that the sequence of bytes
composing the string is not valid according to UTF8 schema. For instance if
you a have 0xCE (ie -50) at the end of the sequence, this is an incomplete
sequence and converting it to string can generate errors.
Try this to run this test
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import org.eclipse.jetty.util.MultiMap;
import org.eclipse.jetty.util.UrlEncoded;
public class TestUrlEncoded {
public static void main (String[] args) throws Exception {
// ab=cd&ef=g<0xCE>
byte[] incompletSequenceAtTheEnd = { 97, 98, 61, 99, 100, 38, 101, 102,
61, 103, -50};
System.out.println("Incomplete sequence at the end:");
fromByteArray(incompletSequenceAtTheEnd);
fromInputStream(new ByteArrayInputStream(incompletSequenceAtTheEnd));
// ef=g<0xCE>&ab=cd
byte[] incompletSequenceInTheMiddle = {101, 102, 61, 103, -50, 38, 97, 98,
61, 99, 100};
System.out.println("\n\nIncomplete sequence in the middle 1:");
fromByteArray(incompletSequenceInTheMiddle);
fromInputStream(new ByteArrayInputStream(incompletSequenceInTheMiddle));
// e<0xCE>=fg&ab=cd
byte[] incompletSequenceInTheMiddle2 = {101, -50 , 61, 102, 103, 38, 97,
98, 61, 99, 100};
System.out.println("\n\nIncomplete sequence in the middle 2:");
fromByteArray(incompletSequenceInTheMiddle2);
fromInputStream(new ByteArrayInputStream(incompletSequenceInTheMiddle2));
// e=<0xCE>f&ab=cd
byte[] incompletSequenceInTheMiddle3 = {101, 61, -50, 102, 38, 97, 98, 61,
99, 100};
System.out.println("\n\nIncomplete sequence in the middle 3:");
fromByteArray(incompletSequenceInTheMiddle3);
fromInputStream(new ByteArrayInputStream(incompletSequenceInTheMiddle3));
}
static void fromByteArray(byte[] b) {
try {
MultiMap<String> values = new MultiMap<>();
UrlEncoded.decodeUtf8To(b, 0, b.length, values);
System.out.println(values);
} catch (Exception e) {
e.printStackTrace();
}
}
static void fromInputStream(InputStream is) {
try {
MultiMap<String> values = new MultiMap<>();
UrlEncoded.decodeUtf8To(is, values, 1000000, 10000000);
System.out.println(values);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I think that the main issue is with the first test-case, where the two
functions have completely different behaviors. But I'd double check also
other test cases, in particular the second one, because the valid pair
'ab=cd' is not correctly parsed because of errors in the preceding pair.
As I said, IMHO this scenario could be completely solved by replacing all
calls to Utf8StringBuilder.toString calls with
Utf8StringBuilder.toReplacedString (that is what has been partially done in
UrlEncoded.decodeUtf8To that takes byte array).
Note also, that since UrlEncoded functions are all static, there's no easy
way to patch this behavior. Currently I have to create a Request wrapper
that is used for each incoming requests and that calls a patched version of
decodeUtf8To when parsing parameters from POST body and/or query string.
-- Ugo
On Sat, Mar 22, 2014 at 12:49 AM, Joakim Erdfelt <[email protected]> wrote:
> Can you provide some examples?
>
> What precise bytes / values do you consider an incomplete UTF8 sequence?
> Please include an example of what you consider an incomplete UTF8 sequence
> "in the middle" and another example as to the problem at the last part of
> the sequence.
>
>
> --
> Joakim Erdfelt <[email protected]>
> webtide.com <http://www.webtide.com/> - intalio.com/jetty
> Expert advice, services and support from from the Jetty & CometD experts
> eclipse.org/jetty - cometd.org
>
>
> On Fri, Mar 21, 2014 at 3:44 PM, Ugo Scaiella <[email protected]>wrote:
>
>> I don't understand the behavior
>> of org.eclipse.jetty.util.UrlEncode.decodeUtf8To methods. Maybe I'm missing
>> some points, but IMHO there are several inconsistent behaviors in case
>> request data is not correctly encoded. I'm currently using v9.1.0 (but I
>> cannot see any change in latest v9.1.3) and I'm using UTF8 as charset for
>> decoding request data.
>>
>> The strange behaviors I noticed are:
>>
>> A) when parsing query string parameters
>> A.1) if the last value of the query string is an incomplete UTF8
>> sequence, the value is added to the map by replacing the last character
>> with Utf8Appendable.REPLACEMENT (in my opinion this is the correct behavior)
>> A.2) if a token (ie a value or a key) in the middle of the query string
>> is an incomplete UTF8 sequence, that token is completely ignored and will
>> never be added to the map. You'll get just warn-level log message.
>>
>> B) when parsing a form-urlencoded body of a POST or PUT request
>> B.1) if the last value of post data is an incomplete UTF8 sequence,
>> a Utf8Appendable.NotUtf8Exception exception is raised and it bubbles up to,
>> for instance, Request.getParameter(). And that is a RuntimeException...
>> B.2) if a token (ie a value or a key) in the middle of the body is an
>> incomplete UTF8 sequence, that token is ignored, just like point (A.2)
>> above.
>>
>> I think that there are several issues in the two overloaded methods
>> org.eclipse.jetty.util.UrlEncode.decodeUtf8To
>> We have two overloaded methods decodeUtf8To in UrlEncoded class: the
>> first one accepts an array of byte as first parameter, while the latter
>> takes an InputStream. Namely the first one is used in scenario (A) and the
>> second one in scenario (B).
>>
>> Both of them, deploy a Utf8StringBuilder to temporary store the current
>> parsed token. But when the token is converted into String we always call
>> buffer.toString() that can throw that exception if the bytes are not a
>> valid UTF8 sequence.
>> In (A.2) and (B.2), that call is inside a try-catch, but catch block do
>> nothing, so the buffer is not reset and the value is not added to the map.
>> In (B.1), call to toString() is outside try-catch so, the exception
>> bubbles up.
>> Scenario (A.1) is fine, because in that case (and only there) we use
>> buffer.toReplacedString() that has a much safer behavior: if the last
>> character is not a valid UTF8 sequence, the Utf8Appendable.REPLACEMENT is
>> appended, the exception is logged (but not thrown) and the resulting string
>> is returned.
>>
>> IMHO, this is the correct behavior, so in
>> org.eclipse.jetty.util.UrlEncode.decodeUtf8To methods, we should replace
>> Utf8StringBuilder.toString calls with Utf8StringBuilder.toReplacedString .
>> Or am I missing something?
>>
>> -- Ugo
>>
>> _______________________________________________
>> jetty-users mailing list
>> [email protected]
>> https://dev.eclipse.org/mailman/listinfo/jetty-users
>>
>>
>
> _______________________________________________
> jetty-users mailing list
> [email protected]
> https://dev.eclipse.org/mailman/listinfo/jetty-users
>
>
--
-- Ugo
_______________________________________________
jetty-users mailing list
[email protected]
https://dev.eclipse.org/mailman/listinfo/jetty-users