Hi Paul, thanks for the code! Looks good. A few minor things: - "public static class"? - In the comment, it should be http://www.apps.ietf.org/rfc/rfc3987.html , not http://www.apps.ietf.org/rfc/rfc3986.html :-) - I think Integer.toHexString(0x00FF & (int)b) may generate one-character codes. - If you're extremely performance-consciuos, you could avoid creating a few temporary objects, and use a decision table for many characters, let's say < 0x10000.
I didn't check the codes >= 0xA0, I trust you copied them correctly. ;-) As for the special ASCII chars, yes, those are the ones that are allowed by RFC 3986. Its predecessor http://www.apps.ietf.org/rfc/rfc2396.html#sec-2.4.3 explains nicely why the others forbidden. I compiled a blacklist below. I added 'ok' to those that IMHO clearly must be escaped. For some, I don't really see a reason why they are forbidden (I wrote 'don't know' below), but they are pretty rare and we sure should escape them. For others, I still don't see a reason, but MediaWiki doesn't use them so we shouldn't either [2] (I wrote 'ok for wikipedia'). We shouldn't even generate any http://dbpedia.org/resource/... URIs containing any of these characters, escaped or not, subject or object. (They may occur in http://dbpedia.org/property/... URIs though, where we must escape them.) The one character that we should not escape is the slash. MediaWiki actually unescapes it when it is used in an internal link. [1] But that's an implementation detail. It may be cleaner to have a generic IRIEscaper and afterwards just do uri.replace("%2F", "/"). forbidden characters: " - don't know # - ok % - ok / - not ok < - ok > - ok ? - ok [ - ok for wikipedia \ - don't know ] - ok for wikipedia ^ - don't know ` - don't know { - ok for wikipedia | - ok for wikipedia } - ok for wikipedia Regards, JC [1] http://en.wikipedia.org/wiki/User:Chrisahn/Sandbox [2] http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Forbidden_characters On Tue, Mar 6, 2012 at 02:09, Paul A. Houle <[email protected]> wrote: > On 3/5/2012 7:14 PM, Jona Christopher Sahnwaldt wrote: >> Dear all, >> >> I just checked a few specs to figure out what would be the best policy >> for DBpedia regarding URI encoding. >> >> In summary, I think DBpedia should encode as few characters as >> possible, e.g. use '&', not '%26'. > > I came up with the following encoding function for the path > component of a URI based on a close reading of RFC 2397. Any disagreements? > > public static class IRIEscaper { > StringBuffer out; > > public String escape(String key){ > out=new StringBuffer(); > final int length = key.length(); > for (int offset = 0; offset < length; ) { > final int codepoint = key.codePointAt(offset); > transformChar(codepoint); > offset += Character.charCount(codepoint); > } > > return out.toString(); > } > > private void transformChar(int cp) { > char[] rawChars=Character.toChars(cp); > if(acceptChar(rawChars,cp)) { > out.append(Character.toChars(cp)); > } else { > percentEncode(rawChars); > } > } > > private void percentEncode(char[] rawChars) { > try { > byte[] bytes=new String(rawChars).getBytes("UTF-8"); > for(byte b:bytes) { > out.append('%'); > out.append(Integer.toHexString(0x00FF & (int) > b).toUpperCase()); > } > } catch(UnsupportedEncodingException ex) { > throw new RuntimeException(ex); > } > } > > // > // this code should implement the 'ipchar' production from > // > // http://www.apps.ietf.org/rfc/rfc3986.html > // > > private boolean acceptChar(char[] chars,int cp) { > if(chars.length==1) { > char c=chars[0]; > if(Character.isLetterOrDigit(c)) > return true; > > if(c=='-' || c=='.' || c=='_' || c=='~') > return true; > > if(c=='!' || c=='$' || c=='&' || c=='\'' || c=='(' || > c==')' > || c=='*' || c=='+' || c==',' || c==';' || c=='=' > || c== ':' || c=='@') > return true; > > if (cp<0xA0) > return false; > } > > if(cp>=0xA0 && cp<=0xD7FF) > return true; > > if(cp>=0xF900 && cp<=0xFDCF) > return true; > > if(cp>=0xFDF0 && cp<=0xFFEF) > return true; > > if (cp>=0x10000 && cp<=0x1FFFD) > return true; > > if (cp>=0x20000 && cp<=0x2FFFD) > return true; > > if (cp>=0x30000 && cp<=0x3FFFD) > return true; > > if (cp>=0x40000 && cp<=0x4FFFD) > return true; > > if (cp>=0x50000 && cp<=0x5FFFD) > return true; > > if (cp>=0x60000 && cp<=0x6FFFD) > return true; > > if (cp>=0x70000 && cp<=0x7FFFD) > return true; > > if (cp>=0x80000 && cp<=0x8FFFD) > return true; > > if (cp>=0x90000 && cp<=0x9FFFD) > return true; > > if (cp>=0xA0000 && cp<=0xAFFFD) > return true; > > if (cp>=0xB0000 && cp<=0xBFFFD) > return true; > > if (cp>=0xC0000 && cp<=0xCFFFD) > return true; > > if (cp>=0xD0000 && cp<=0xDFFFD) > return true; > > if (cp>=0xE1000 && cp<=0xEFFFD) > return true; > > return false; > } > } > > > > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
