Hi Paul,

I'm back from vacation.

You're right. But such an error is also expected. The original design never tried to out-do the java.net.URL. If a system ID input fails URL, it shall result in an exception.

The patch that supplied the extra encoding was provided to both Sun and Apache, and applied to Sun sources. However, it never went into the Apache code base (refer to https://issues.apache.org/jira/browse/XERCESJ-1156). I thought of removing the patch, bringing our source in sync with that of Apache. But then I feared that we might get a regression since the patch has been in the source for so many years.

Thus, this ugly solution (removing would be prettier) to leave the old change as is but use java.net.URL in all other cases.

By the way, we can only consider this one for 7u8 now.

Thanks,
Joe


On 6/26/2012 11:51 PM, Paul Sandoz wrote:
Hi,

On Jun 26, 2012, at 11:59 PM, Joe Wang wrote:

Hi Paul,

That method was contributed by engineers from Korea and intended to handle 
paths that contained international characters, so that was how it was named.  
It was an extra processing added. Outside of that scenario, we'd want to skip 
the process and get back to letting URL handle the input, whether the system id 
contains space or '[', and etc.

Your fix will fail if there is an IPv6 encoded address for the host part and 
there are non-ASCII characters present in, for example, the path part.

If the intent is to *never* percent encode ASCII characters you should change 
the following (and JavaDoc) to be consistent:

2638         // for each byte
2639         for (i = 0; i<  len; i++) {
2640             b = bytes[i];
2641             // for non-ascii character: make it positive, then escape
2642             if (b<  0) {
2643                 ch = b + 256;
2644                 buffer.append('%');
2645                 buffer.append(gHexChs[ch>>  4]);
2646                 buffer.append(gHexChs[ch&  0xf]);
2647             }
2648             else if (b != '%'&&  b != '#'&&  gNeedEscaping[b]) {  //<--- 
remove this block
2649                 buffer.append('%');
2650                 buffer.append(gAfterEscaping1[b]);
2651                 buffer.append(gAfterEscaping2[b]);
2652             }
2653             else {
2654                 buffer.append((char)b);
2655             }
2656         }


Thankfully java.net.URL is much more forgiving (a polite way of saying buggy!) 
than java.net.URI and accepts unencoded reserved ASCII characters as part of 
the URI encoded characters.

Something does not smell right here. Arguably the system ID should be a 
correctly encoded URI to begin with otherwise an error should result.

Paul.

-Joe

On 6/25/2012 9:13 AM, Paul Sandoz wrote:
Hi Joe,

What happens if there is a space character or other characters in the string 
that should be encoded ?

   http://greenbytes.de/tech/webdav/rfc2396.html#rfc.section.2.4.3

I suspect "escapeNonUSAscii" is slightly misleading, it should be really called something 
like "escapeCharactersInUriString".

Note that '[' and ']' are not valid URI characters outside of an IPv6 encoded 
address.

Paul.

On Jun 23, 2012, at 1:09 AM, Joe Wang wrote:

Hi,

This is a patch to fix the IPv6 issue.

In a previous patch to fix an issue with system id containing international 
characters, an extra character escaping was added so that any URL passed to the 
parser goes through method escapeNonUSAscii before it's used to construct an 
URL.

However, literal IPv6 addresses are enclosed in square brackets. The escapeNonUSAscii encoded 
address will become unrecognizable to URL that would throw a java.net.MalformedURLException.  
An address such ashttp://[fe80::la03:73ff:fead:f7b0]/note.xml is encoded as 
http://%5Bfe80::la03:73ff:fead:f7b0%5D/note.xml";, resulting in 
java.net.MalformedURLException: For input string: ":la03:73ff:fead:f7b0%5D".

This patch skips the encoding process and returns it as is if there're no 
non-ascii characters.

webrev:http://cr.openjdk.java.net/~joehw/7u6/7166896/webrev/

Please review.

Thanks,
Joe

Reply via email to