Hi all, As some of you may know, a Serbian version of DBpedia is currently in the works. Now, Serbian, unlike any other language in Europe, is digraphic in nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing information in any modern knowledge base, but can often be a major obstacle for information retrieval.
For instance, most Serbs rely on the Latin alphabet for communication/interaction on the Web. That means a large portion of the information is (and, often, expected to be) encoded in ISO 8859-2 (i.e. Latin-2). And, yet, 99% of the information in the Serbian Wikipedia dumps is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software performs romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice versa) on-the-fly, at retrieval time (Wikipedia appears to be doing this), many attempts at information extraction will be doomed to fail. This directly affects common tasks such as keyword search, label-based SPARQL querying, named entity recognition, etc. What I would like to do is improve some of the existing DBpedia extractors, or develop new ones, that would take this problem into consideration and perform romanization of Wikipedia dumps so as to output information encoded in *both* scripts. Now, I know storing the same information twice might not be the most elegant solution, but unless someone is to include romanization/cyrillization features in the next version of SPARQL, I don't see a better solution at the moment. Of course, there is also the matter of perspective - one could argue that although the information is the same, the very fact that different character sequences are needed to describe the same piece of knowledge makes this problem fall into the domain of multilinguality. So, the general idea is to use a single IRI per resource, but have two separate triples for any literal originally encoded in cyrillic. For example: <http://sr.dbpedia.org/resource/Парсер> <http://www.w3.org/2000/01/rdf-schema#label> "Парсер"@sr-Cyrl . <http://sr.dbpedia.org/resource/Парсер> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn . The above language tags are as per IANA Language Subtag Registry [1], which lists them as redundant, though, so a "sr" tag, instead, could be enough for both. I'm no DBpedia core expert, so some tips, ideas, directions or any other information that would help me get started would be much appreciated! Best, Uros [1] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
