Re: [Dbpedia-developers] Support for digraphia

Uros Milosevic Mon, 02 Dec 2013 07:51:34 -0800

Hi Andrea/Dimitris,

Thanks for the tips. Actually, when I said I was no core expert, I meant I
was an absolute beginner. :) I wanted to go with an extractor because that
seemed simpler (and safer) than meddling with the core. Most of the stuff
in there still seems rather confusing, but I'll look into it.


So, the UriPolicy code is where the triples get written (pointer to the
exact line, anyone?), or is this simply where you'd like to place the new
code? Also, would "UriPolicy" remain an adequate name for the class, then?

Best,
Uros


> Maybe something like:
>
> script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>
> where you specify a list of (languageTag:transliterator) separated by ';'
> for one language?
> The transliterator could be either "identity" (no transformation) or a
> icu4j transliterator-ID.
>
> As Dimitris said, Uros please feel free to ask if you need help!
>
> Cheers
> Andrea
>
>
> 2013/11/30 Dimitris Kontokostas <[email protected]>
>
>>
>>
>>
>> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>> <[email protected]>wrote:
>>
>>> Hello Uros,
>>>
>>> that's a really interesting problem :)
>>> I am no expert either but probably the best approach would be to
>>> "duplicate" triples when they are going to be written (e.g. in the
>>> destinations package), instead of modifying the extractors.
>>>
>>
>> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
>> string object transformations (now it only applies to URIs / IRIs)
>> and use the configuration files to select the desired output [2].
>> Uros, do you want to give it a shot? You can always ask for help here ;)
>>
>> [1]
>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>> [2]
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>>
>>
>>> For what regards which tools to use, it looks like icu4j
>>> Translitterator
>>> suits your needs, e.g.
>>>
>>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("ÐÐ°Ð»Ð°ÐºÐ¾Ñ
>>> 5
>>> (ÐµÐ½Ð³Ð». Malachor V) ÑÐµ Ð¸Ð·Ð¼Ð¸ÑÑÐµÐ½Ð° Ð¿Ð»Ð°Ð½ÐµÑÐ° Ñ
>>> ÑÐ½Ð¸Ð²ÐµÑÐ·ÑÐ¼Ñ Ð Ð°ÑÐ¾Ð²Ð° Ð·Ð²ÐµÐ·Ð´Ð°.")
>>>
>>> results in
>>>
>>> Malakor 5 (engl. Malachor V) je izmiÅ¡ljena planeta u univerzumu Ratova
>>> zvezda.
>>>
>>> What do you think?
>>>
>>> Cheers
>>>  Andrea
>>>
>>>
>>> 2013/11/29 Uros Milosevic <[email protected]>
>>>
>>>> Hi all,
>>>>
>>>> As some of you may know, a Serbian version of DBpedia is currently in
>>>> the
>>>> works. Now, Serbian, unlike any other language in Europe, is digraphic
>>>> in
>>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
>>>> Latin
>>>> alphabet. This is absolutely fine for storing information in any
>>>> modern
>>>> knowledge base, but can often be a major obstacle for information
>>>> retrieval.
>>>>
>>>> For instance, most Serbs rely on the Latin alphabet for
>>>> communication/interaction on the Web. That means a large portion of
>>>> the
>>>> information is (and, often, expected to be) encoded in ISO 8859-2
>>>> (i.e.
>>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia
>>>> dumps
>>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
>>>> performs
>>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
>>>> vice
>>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
>>>> this),
>>>> many attempts at information extraction will be doomed to fail. This
>>>> directly affects common tasks such as keyword search, label-based
>>>> SPARQL
>>>> querying, named entity recognition, etc.
>>>>
>>>> What I would like to do is improve some of the existing DBpedia
>>>> extractors, or develop new ones, that would take this problem into
>>>> consideration and perform romanization of Wikipedia dumps so as to
>>>> output
>>>> information encoded in *both* scripts. Now, I know storing the same
>>>> information twice might not be the most elegant solution, but unless
>>>> someone is to include romanization/cyrillization features in the next
>>>> version of SPARQL, I don't see a better solution at the moment. Of
>>>> course,
>>>> there is also the matter of perspective - one could argue that
>>>> although
>>>> the information is the same, the very fact that different character
>>>> sequences are needed to describe the same piece of knowledge makes
>>>> this
>>>> problem fall into the domain of multilinguality.
>>>>
>>>> So, the general idea is to use a single IRI per resource, but have two
>>>> separate triples for any literal originally encoded in cyrillic. For
>>>> example:
>>>>
>>>> <
>>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/Ð&%231072;&%231088;&%231089;&%231077;&%231088>
>>>> ;>
>>>> <http://www.w3.org/2000/01/rdf-schema#label>
>>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
>>>> <
>>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/Ð&%231072;&%231088;&%231089;&%231077;&%231088>
>>>> ;>
>>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>>>>
>>>> The above language tags are as per IANA Language Subtag Registry [1],
>>>> which lists them as redundant, though, so a "sr" tag, instead, could
>>>> be
>>>> enough for both.
>>>>
>>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
>>>> other
>>>> information that would help me get started would be much appreciated!
>>>>
>>>> Best,
>>>> Uros
>>>>
>>>> [1]
>>>>
>>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Rapidly troubleshoot problems before they affect your business. Most
>>>> IT
>>>> organizations don't have a clear picture of how application
>>>> performance
>>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>>> your
>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>> AppDynamics Pro!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Rapidly troubleshoot problems before they affect your business. Most IT
>>> organizations don't have a clear picture of how application performance
>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>> your
>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>> AppDynamics
>>> Pro!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>



------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to