Hi all,

As some of you may know, a Serbian version of DBpedia is currently in the
works. Now, Serbian, unlike any other language in Europe, is digraphic in
nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin
alphabet. This is absolutely fine for storing information in any modern
knowledge base, but can often be a major obstacle for information
retrieval.

For instance, most Serbs rely on the Latin alphabet for
communication/interaction on the Web. That means a large portion of the
information is (and, often, expected to be) encoded in ISO 8859-2 (i.e.
Latin-2). And, yet, 99% of the information in the Serbian Wikipedia dumps
is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software performs
romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice
versa) on-the-fly, at retrieval time (Wikipedia appears to be doing this),
many attempts at information extraction will be doomed to fail. This
directly affects common tasks such as keyword search, label-based SPARQL
querying, named entity recognition, etc.

What I would like to do is improve some of the existing DBpedia
extractors, or develop new ones, that would take this problem into
consideration and perform romanization of Wikipedia dumps so as to output
information encoded in *both* scripts. Now, I know storing the same
information twice might not be the most elegant solution, but unless
someone is to include romanization/cyrillization features in the next
version of SPARQL, I don't see a better solution at the moment. Of course,
there is also the matter of perspective - one could argue that although
the information is the same, the very fact that different character
sequences are needed to describe the same piece of knowledge makes this
problem fall into the domain of multilinguality.

So, the general idea is to use a single IRI per resource, but have two
separate triples for any literal originally encoded in cyrillic. For
example:

<http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;>
<http://www.w3.org/2000/01/rdf-schema#label>
"&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
<http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;>
<http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .

The above language tags are as per IANA Language Subtag Registry [1],
which lists them as redundant, though, so a "sr" tag, instead, could be
enough for both.

I'm no DBpedia core expert, so some tips, ideas, directions or any other
information that would help me get started would be much appreciated!

Best,
Uros

[1]
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry




------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to