Hi Paul,

thanks for the code! Looks good. A few minor things:
- "public static class"?
- In the comment, it should be
http://www.apps.ietf.org/rfc/rfc3987.html ,
not http://www.apps.ietf.org/rfc/rfc3986.html :-)
- I think Integer.toHexString(0x00FF & (int)b)
may generate one-character codes.
- If you're extremely performance-consciuos,
you could avoid creating a few temporary
objects, and use a decision table for many
characters, let's say < 0x10000.

I didn't check the codes >= 0xA0, I trust you
copied them correctly. ;-)

As for the special ASCII chars, yes, those
are the ones that are allowed by RFC 3986.
Its predecessor
http://www.apps.ietf.org/rfc/rfc2396.html#sec-2.4.3
explains nicely why the others forbidden.
I compiled a blacklist below. I added 'ok' to
those that IMHO clearly must be escaped.

For some, I don't really see a reason why
they are forbidden (I wrote 'don't know' below),
but they are pretty rare and we sure should
escape them.

For others, I still don't see a reason, but
MediaWiki doesn't use them so we shouldn't
either [2] (I wrote 'ok for wikipedia'). We shouldn't
even generate any http://dbpedia.org/resource/...
URIs containing any of these characters, escaped
or not, subject or object. (They may occur in
http://dbpedia.org/property/... URIs though,
where we must escape them.)

The one character that we should not escape
is the slash. MediaWiki actually unescapes it
when it is used in an internal link. [1] But that's
an implementation detail. It may be cleaner
to have a generic IRIEscaper and afterwards
just do uri.replace("%2F", "/").

forbidden characters:

" - don't know
# - ok
% - ok
/ - not ok
< - ok
> - ok
? - ok
[ - ok for wikipedia
\ - don't know
] - ok for wikipedia
^ - don't know
` - don't know
{ - ok for wikipedia
| - ok for wikipedia
} - ok for wikipedia


Regards,
JC

[1] http://en.wikipedia.org/wiki/User:Chrisahn/Sandbox
[2] 
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Forbidden_characters


On Tue, Mar 6, 2012 at 02:09, Paul A. Houle <[email protected]> wrote:
>  On 3/5/2012 7:14 PM, Jona Christopher Sahnwaldt wrote:
>> Dear all,
>>
>> I just checked a few specs to figure out what would be the best policy
>> for DBpedia regarding URI encoding.
>>
>> In summary, I think DBpedia should encode as few characters as
>> possible, e.g. use '&', not '%26'.
>
>     I came up with the following encoding function for the path
> component of a URI based on a close reading of RFC 2397.  Any disagreements?
>
>     public static class IRIEscaper {
>         StringBuffer out;
>
>         public String escape(String key){
>             out=new StringBuffer();
>             final int length = key.length();
>             for (int offset = 0; offset < length; ) {
>                final int codepoint = key.codePointAt(offset);
>                transformChar(codepoint);
>                offset += Character.charCount(codepoint);
>             }
>
>             return out.toString();
>         }
>
>         private void transformChar(int cp) {
>             char[] rawChars=Character.toChars(cp);
>             if(acceptChar(rawChars,cp)) {
>                 out.append(Character.toChars(cp));
>             } else {
>                 percentEncode(rawChars);
>             }
>         }
>
>         private void percentEncode(char[] rawChars) {
>             try {
>                 byte[] bytes=new String(rawChars).getBytes("UTF-8");
>                 for(byte b:bytes) {
>                     out.append('%');
>                     out.append(Integer.toHexString(0x00FF & (int)
> b).toUpperCase());
>                 }
>             } catch(UnsupportedEncodingException ex) {
>                 throw new RuntimeException(ex);
>             }
>         }
>
>         //
>         // this code should implement the 'ipchar' production from
>         //
>         // http://www.apps.ietf.org/rfc/rfc3986.html
>         //
>
>         private boolean acceptChar(char[] chars,int cp) {
>             if(chars.length==1) {
>                 char c=chars[0];
>                 if(Character.isLetterOrDigit(c))
>                     return true;
>
>                 if(c=='-' || c=='.' || c=='_' || c=='~')
>                     return true;
>
>                 if(c=='!' || c=='$' || c=='&' || c=='\'' || c=='(' ||
> c==')'
>                     || c=='*' || c=='+' || c==',' || c==';' || c=='='
>                         || c== ':' || c=='@')
>                     return true;
>
>                 if (cp<0xA0)
>                     return false;
>             }
>
>             if(cp>=0xA0 && cp<=0xD7FF)
>                 return true;
>
>             if(cp>=0xF900 && cp<=0xFDCF)
>                 return true;
>
>             if(cp>=0xFDF0 && cp<=0xFFEF)
>                 return true;
>
>             if (cp>=0x10000 && cp<=0x1FFFD)
>                 return true;
>
>             if (cp>=0x20000 && cp<=0x2FFFD)
>                 return true;
>
>             if (cp>=0x30000 && cp<=0x3FFFD)
>                 return true;
>
>             if (cp>=0x40000 && cp<=0x4FFFD)
>                 return true;
>
>             if (cp>=0x50000 && cp<=0x5FFFD)
>                 return true;
>
>             if (cp>=0x60000 && cp<=0x6FFFD)
>                 return true;
>
>             if (cp>=0x70000 && cp<=0x7FFFD)
>                 return true;
>
>             if (cp>=0x80000 && cp<=0x8FFFD)
>                 return true;
>
>             if (cp>=0x90000 && cp<=0x9FFFD)
>                 return true;
>
>             if (cp>=0xA0000 && cp<=0xAFFFD)
>                 return true;
>
>             if (cp>=0xB0000 && cp<=0xBFFFD)
>                 return true;
>
>             if (cp>=0xC0000 && cp<=0xCFFFD)
>                 return true;
>
>             if (cp>=0xD0000 && cp<=0xDFFFD)
>                 return true;
>
>             if (cp>=0xE1000 && cp<=0xEFFFD)
>                 return true;
>
>             return false;
>         }
>     }
>
>
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to