Hello Uros,
that's a really interesting problem :)
I am no expert either but probably the best approach would be to
"duplicate" triples when they are going to be written (e.g. in the
destinations package), instead of modifying the extractors.
For what regards which tools to use, it looks like icu4j Translitterator
suits your needs, e.g.
Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор 5
(енгл. Malachor V) је измишљена планета у универзуму Ратова звезда.")
results in
Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
zvezda.
What do you think?
Cheers
Andrea
2013/11/29 Uros Milosevic <[email protected]>
> Hi all,
>
> As some of you may know, a Serbian version of DBpedia is currently in the
> works. Now, Serbian, unlike any other language in Europe, is digraphic in
> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin
> alphabet. This is absolutely fine for storing information in any modern
> knowledge base, but can often be a major obstacle for information
> retrieval.
>
> For instance, most Serbs rely on the Latin alphabet for
> communication/interaction on the Web. That means a large portion of the
> information is (and, often, expected to be) encoded in ISO 8859-2 (i.e.
> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia dumps
> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software performs
> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice
> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing this),
> many attempts at information extraction will be doomed to fail. This
> directly affects common tasks such as keyword search, label-based SPARQL
> querying, named entity recognition, etc.
>
> What I would like to do is improve some of the existing DBpedia
> extractors, or develop new ones, that would take this problem into
> consideration and perform romanization of Wikipedia dumps so as to output
> information encoded in *both* scripts. Now, I know storing the same
> information twice might not be the most elegant solution, but unless
> someone is to include romanization/cyrillization features in the next
> version of SPARQL, I don't see a better solution at the moment. Of course,
> there is also the matter of perspective - one could argue that although
> the information is the same, the very fact that different character
> sequences are needed to describe the same piece of knowledge makes this
> problem fall into the domain of multilinguality.
>
> So, the general idea is to use a single IRI per resource, but have two
> separate triples for any literal originally encoded in cyrillic. For
> example:
>
> <http://sr.dbpedia.org/resource/Парсер
> ;>
> <http://www.w3.org/2000/01/rdf-schema#label>
> "Парсер"@sr-Cyrl .
> <http://sr.dbpedia.org/resource/Парсер
> ;>
> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>
> The above language tags are as per IANA Language Subtag Registry [1],
> which lists them as redundant, though, so a "sr" tag, instead, could be
> enough for both.
>
> I'm no DBpedia core expert, so some tips, ideas, directions or any other
> information that would help me get started would be much appreciated!
>
> Best,
> Uros
>
> [1]
>
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
>
>
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers