On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna <[email protected]> wrote:
> Hello Uros,
>
> that's a really interesting problem :)
> I am no expert either but probably the best approach would be to
> "duplicate" triples when they are going to be written (e.g. in the
> destinations package), instead of modifying the extractors.
>
I agree, I'd suggest we extend the UriPolicy [1] functionality to do string
object transformations (now it only applies to URIs / IRIs)
and use the configuration files to select the desired output [2].
Uros, do you want to give it a shot? You can always ask for help here ;)
[1]
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
[2]
https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
> For what regards which tools to use, it looks like icu4j Translitterator
> suits your needs, e.g.
>
> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор 5
> (енгл. Malachor V) је измишљена планета у универзуму Ратова звезда.")
>
> results in
>
> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
> zvezda.
>
> What do you think?
>
> Cheers
> Andrea
>
>
> 2013/11/29 Uros Milosevic <[email protected]>
>
>> Hi all,
>>
>> As some of you may know, a Serbian version of DBpedia is currently in the
>> works. Now, Serbian, unlike any other language in Europe, is digraphic in
>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin
>> alphabet. This is absolutely fine for storing information in any modern
>> knowledge base, but can often be a major obstacle for information
>> retrieval.
>>
>> For instance, most Serbs rely on the Latin alphabet for
>> communication/interaction on the Web. That means a large portion of the
>> information is (and, often, expected to be) encoded in ISO 8859-2 (i.e.
>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia dumps
>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software performs
>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice
>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing this),
>> many attempts at information extraction will be doomed to fail. This
>> directly affects common tasks such as keyword search, label-based SPARQL
>> querying, named entity recognition, etc.
>>
>> What I would like to do is improve some of the existing DBpedia
>> extractors, or develop new ones, that would take this problem into
>> consideration and perform romanization of Wikipedia dumps so as to output
>> information encoded in *both* scripts. Now, I know storing the same
>> information twice might not be the most elegant solution, but unless
>> someone is to include romanization/cyrillization features in the next
>> version of SPARQL, I don't see a better solution at the moment. Of course,
>> there is also the matter of perspective - one could argue that although
>> the information is the same, the very fact that different character
>> sequences are needed to describe the same piece of knowledge makes this
>> problem fall into the domain of multilinguality.
>>
>> So, the general idea is to use a single IRI per resource, but have two
>> separate triples for any literal originally encoded in cyrillic. For
>> example:
>>
>> <http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> ;>
>> <http://www.w3.org/2000/01/rdf-schema#label>
>> "Парсер"@sr-Cyrl .
>> <http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> ;>
>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>>
>> The above language tags are as per IANA Language Subtag Registry [1],
>> which lists them as redundant, though, so a "sr" tag, instead, could be
>> enough for both.
>>
>> I'm no DBpedia core expert, so some tips, ideas, directions or any other
>> information that would help me get started would be much appreciated!
>>
>> Best,
>> Uros
>>
>> [1]
>>
>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers