Maybe something like:
script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
where you specify a list of (languageTag:transliterator) separated by ';'
for one language?
The transliterator could be either "identity" (no transformation) or a
icu4j transliterator-ID.
As Dimitris said, Uros please feel free to ask if you need help!
Cheers
Andrea
2013/11/30 Dimitris Kontokostas <[email protected]>
>
>
>
> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna <[email protected]>wrote:
>
>> Hello Uros,
>>
>> that's a really interesting problem :)
>> I am no expert either but probably the best approach would be to
>> "duplicate" triples when they are going to be written (e.g. in the
>> destinations package), instead of modifying the extractors.
>>
>
> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
> string object transformations (now it only applies to URIs / IRIs)
> and use the configuration files to select the desired output [2].
> Uros, do you want to give it a shot? You can always ask for help here ;)
>
> [1]
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
> [2]
> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>
>
>> For what regards which tools to use, it looks like icu4j Translitterator
>> suits your needs, e.g.
>>
>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор 5
>> (енгл. Malachor V) је измишљена планета у универзуму Ратова звезда.")
>>
>> results in
>>
>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
>> zvezda.
>>
>> What do you think?
>>
>> Cheers
>> Andrea
>>
>>
>> 2013/11/29 Uros Milosevic <[email protected]>
>>
>>> Hi all,
>>>
>>> As some of you may know, a Serbian version of DBpedia is currently in the
>>> works. Now, Serbian, unlike any other language in Europe, is digraphic in
>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin
>>> alphabet. This is absolutely fine for storing information in any modern
>>> knowledge base, but can often be a major obstacle for information
>>> retrieval.
>>>
>>> For instance, most Serbs rely on the Latin alphabet for
>>> communication/interaction on the Web. That means a large portion of the
>>> information is (and, often, expected to be) encoded in ISO 8859-2 (i.e.
>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia dumps
>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
>>> performs
>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
>>> vice
>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
>>> this),
>>> many attempts at information extraction will be doomed to fail. This
>>> directly affects common tasks such as keyword search, label-based SPARQL
>>> querying, named entity recognition, etc.
>>>
>>> What I would like to do is improve some of the existing DBpedia
>>> extractors, or develop new ones, that would take this problem into
>>> consideration and perform romanization of Wikipedia dumps so as to output
>>> information encoded in *both* scripts. Now, I know storing the same
>>> information twice might not be the most elegant solution, but unless
>>> someone is to include romanization/cyrillization features in the next
>>> version of SPARQL, I don't see a better solution at the moment. Of
>>> course,
>>> there is also the matter of perspective - one could argue that although
>>> the information is the same, the very fact that different character
>>> sequences are needed to describe the same piece of knowledge makes this
>>> problem fall into the domain of multilinguality.
>>>
>>> So, the general idea is to use a single IRI per resource, but have two
>>> separate triples for any literal originally encoded in cyrillic. For
>>> example:
>>>
>>> <
>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>> ;>
>>> <http://www.w3.org/2000/01/rdf-schema#label>
>>> "Парсер"@sr-Cyrl .
>>> <
>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>> ;>
>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>>>
>>> The above language tags are as per IANA Language Subtag Registry [1],
>>> which lists them as redundant, though, so a "sr" tag, instead, could be
>>> enough for both.
>>>
>>> I'm no DBpedia core expert, so some tips, ideas, directions or any other
>>> information that would help me get started would be much appreciated!
>>>
>>> Best,
>>> Uros
>>>
>>> [1]
>>>
>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Rapidly troubleshoot problems before they affect your business. Most IT
>>> organizations don't have a clear picture of how application performance
>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>> your
>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>> AppDynamics Pro!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Kontokostas Dimitris
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers