Hi, we try to be as close as possible to the Wikipedia title encoding scheme. The previous %-encoding of comma and ampersand is a bug that will be corrected in the next release.
The current behavior is as follows: - The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same. - The special characters ".", "-", "*", "/", "&", ":", "_" and "," remain the same (some of them only in the upcoming release, including the comma). - The space character " " is converted into a plus sign "_". - All other characters are unsafe and are first converted into one or more bytes using UTF-8 encoding. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. - Furthermore, multiple underscores are collapsed into one. The class org.dbpedia.extraction.util.WikiUtil.scala in the framework might also give pointers on how to with this issue. Best, Max On Wed, Oct 13, 2010 at 11:29 PM, Paul Houle <[email protected]> wrote: > I notice lines in the dbpedia dumps that look like > > <http://dbpedia.org/resource/Boston%2C_MA> > <http://dbpedia.org/property/redirect> <http://dbpedia.org/resource/Boston> > . > > Note the URL encoded %2C=",". > > Anyhow, if I go to > > http://dbpedia.org/page/Boston%2C_MA > > I see two redirects [one of which unescapes the comma] and ultimately > end up at > > http://dbpedia.org/page/Boston > > If I go to Wikipedia > > http://wikipedia.org/page/Boston%2C_MA > > I get redirected to > > http://wikipedia.org/page/Boston,_MA > > which, oddly, displays the same content as "Boston" [rather than 301 > redirecting...] > > When I do > > curl -H "Accept: application/rdf+xml" http://dbpedia.org/data/Boston.xml > > I see stuff like > > <rdf:Description > rdf:about="http://dbpedia.org/resource/Harvey_Mason%2C_Jr."><dbpedia-owl:birthPlace > xmlns:dbpedia-owl="http://dbpedia.org/ontology/" > rdf:resource="http://dbpedia.org/resource/Boston"/></rdf:Description> > > Now If I run the SPARQL query > > select ?Predicate where {<http://dbpedia.org/resource/Harvey_Mason,_Jr.> > ?Predicate <http://dbpedia.org/resource/Boston> } > > I get nothing, but if I run > > select ?Predicate where {<http://dbpedia.org/resource/Harvey_Mason%2C_Jr.> > ?Predicate <http://dbpedia.org/resource/Boston> } > > I get > > http://dbpedia.org/ontology/birthPlace > > So it looks like the %-encoded URI is the "real URI" in dbpedia. > Obviously I ought to keep it around in case I want to run a SPARQL query now > and then. Also, dbpedia encodes wikipedia this way as well, > > <http://en.wikipedia.org/wiki/Harvey_Mason%2C_Jr.> > <http://xmlns.com/foaf/0.1/primaryTopic> > <http://dbpedia.org/resource/Harvey_Mason%2C_Jr.> . > > ------ > > I took a look at some standards docs and found: > > http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference > > I see that we encode UTF-8 text as octets, and if the octets aren't > US-ASCII characters, I wed %-encode them. However, the spec also says > that > > "Note: Because of the risk of confusion between RDF URI references that > would be equivalent if derefenced, the use of %-escaped characters in RDF > URI references is strongly discouraged. " > > ------ > > Now the problem I've got with the Ookaboo API is that I know people are > going to punch in > > http://wikipedia.org/page/Boston,_MA > > and I need to turn this into the right dbpedia URL. My plan for dealing > with this is to > > (i) store the exact URI I get out of dbpedia, > (ii) always give people the exact URI out of dbpedia (if I publish RDFa or > JSON data), > (iii) give the same URI for wikipedia that dbpedia gives (in HTML, RDFa, > etc.) > (iv) if I get a query, apply the same canonicalization rules that dbpedia > uses... > > Which begs the question of what exactly those rules are. What are they? > > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > > ------------------------------------------------------------------------------ Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
