Re: [Dbpedia-developers] Support for digraphia

Jona Christopher Sahnwaldt Tue, 03 Dec 2013 07:55:55 -0800

Maybe you could post-process the DBpedia dumps with this tool?
http://www.huge-man-linux.net/man1/recode-sr-latin.html


On 3 December 2013 16:33, Jona Christopher Sahnwaldt <[email protected]> wrote:
> Hi all,
>
> I don't think UriPolicy is a good place to do this...
>
> But anyway, I don't understand the problem yet. :-)
>
> Uros, you wrote about ISO 8859-2 and ISO 15924.
>
> ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia
> is not using it, and I know that DBpedia is not using it. I think
> Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML
> dumps are UTF-8 encoded, and so are the DBpedia dumps.
>
> ISO 15924 is not a character encoding, but a way to specify the names
> of scripts. See https://en.wikipedia.org/wiki/ISO_15924
>
> What would romanization or cyrillization do exactly? Is there a
> one-to-one mapping between letters? Or letter sequences?
>
> Cheers,
> JC
>
> On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote:
>> Hi Uros,
>>
>> Don't worry, as we said we are here to help if you get stuck;) we all
>> started like this.
>>
>> If you look at the formatters package you will understand what's going on.
>> We have formatters that write a triple based on some policies we define.
>> We parse the policies at runtime, create formatters based on these policies
>> and feed them to destinations.
>>
>> I think we should generalize URIPolicy to TriplePolicy and create a
>> "transliterate" action.
>> I made a change in the URIPolicy code to make it more descriptive [1]
>> Right now we have support only for URIs but if you change the following it
>> should be a good start to make your changes
>>
>>   //String: Uri or Literal, Boolean: is URI or not, String: output (new URI
>> or transliterated string)
>>   type Policy = (String, Boolean) => String
>>
>>   type PolicyApplicable = (String, Boolean) => Boolean
>>
>> I also submitted a feature request [2], you can make a proper description
>> and continue the discussion there
>>
>> Cheers,
>> Dimitris
>>
>>
>> [1] https://github.com/dbpedia/extraction-framework/pull/131
>> [2] https://github.com/dbpedia/extraction-framework/issues/130
>>
>>
>> On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]>
>> wrote:
>>>
>>> Hi Andrea/Dimitris,
>>>
>>> Thanks for the tips. Actually, when I said I was no core expert, I meant I
>>> was an absolute beginner. :) I wanted to go with an extractor because that
>>> seemed simpler (and safer) than meddling with the core. Most of the stuff
>>> in there still seems rather confusing, but I'll look into it.
>>>
>>> So, the UriPolicy code is where the triples get written (pointer to the
>>> exact line, anyone?), or is this simply where you'd like to place the new
>>> code? Also, would "UriPolicy" remain an adequate name for the class, then?
>>>
>>> Best,
>>> Uros
>>>
>>>
>>> > Maybe something like:
>>> >
>>> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>>> >
>>> > where you specify a list of (languageTag:transliterator) separated by
>>> > ';'
>>> > for one language?
>>> > The transliterator could be either "identity" (no transformation) or a
>>> > icu4j transliterator-ID.
>>> >
>>> > As Dimitris said, Uros please feel free to ask if you need help!
>>> >
>>> > Cheers
>>> > Andrea
>>> >
>>> >
>>> > 2013/11/30 Dimitris Kontokostas <[email protected]>
>>> >
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>>> >> <[email protected]>wrote:
>>> >>
>>> >>> Hello Uros,
>>> >>>
>>> >>> that's a really interesting problem :)
>>> >>> I am no expert either but probably the best approach would be to
>>> >>> "duplicate" triples when they are going to be written (e.g. in the
>>> >>> destinations package), instead of modifying the extractors.
>>> >>>
>>> >>
>>> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
>>> >> string object transformations (now it only applies to URIs / IRIs)
>>> >> and use the configuration files to select the desired output [2].
>>> >> Uros, do you want to give it a shot? You can always ask for help here
>>> >> ;)
>>> >>
>>> >> [1]
>>> >>
>>> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>>> >> [2]
>>> >>
>>> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>>> >>
>>> >>
>>> >>> For what regards which tools to use, it looks like icu4j
>>> >>> Translitterator
>>> >>> suits your needs, e.g.
>>> >>>
>>> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
>>> >>> 5
>>> >>> (енгл. Malachor V) је измишљена планета у
>>> >>> универзуму Ратова звезда.")
>>> >>>
>>> >>> results in
>>> >>>
>>> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
>>>
>>> >>> zvezda.
>>> >>>
>>> >>> What do you think?
>>> >>>
>>> >>> Cheers
>>> >>>  Andrea
>>> >>>
>>> >>>
>>> >>> 2013/11/29 Uros Milosevic <[email protected]>
>>> >>>
>>> >>>> Hi all,
>>> >>>>
>>> >>>> As some of you may know, a Serbian version of DBpedia is currently in
>>> >>>> the
>>> >>>> works. Now, Serbian, unlike any other language in Europe, is
>>> >>>> digraphic
>>> >>>> in
>>> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
>>> >>>> Latin
>>> >>>> alphabet. This is absolutely fine for storing information in any
>>> >>>> modern
>>> >>>> knowledge base, but can often be a major obstacle for information
>>> >>>> retrieval.
>>> >>>>
>>> >>>> For instance, most Serbs rely on the Latin alphabet for
>>> >>>> communication/interaction on the Web. That means a large portion of
>>> >>>> the
>>> >>>> information is (and, often, expected to be) encoded in ISO 8859-2
>>> >>>> (i.e.
>>> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia
>>> >>>> dumps
>>> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
>>> >>>> performs
>>> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
>>> >>>> vice
>>> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
>>> >>>> this),
>>> >>>> many attempts at information extraction will be doomed to fail. This
>>> >>>> directly affects common tasks such as keyword search, label-based
>>> >>>> SPARQL
>>> >>>> querying, named entity recognition, etc.
>>> >>>>
>>> >>>> What I would like to do is improve some of the existing DBpedia
>>> >>>> extractors, or develop new ones, that would take this problem into
>>> >>>> consideration and perform romanization of Wikipedia dumps so as to
>>> >>>> output
>>> >>>> information encoded in *both* scripts. Now, I know storing the same
>>> >>>> information twice might not be the most elegant solution, but unless
>>> >>>> someone is to include romanization/cyrillization features in the next
>>> >>>> version of SPARQL, I don't see a better solution at the moment. Of
>>> >>>> course,
>>> >>>> there is also the matter of perspective - one could argue that
>>> >>>> although
>>> >>>> the information is the same, the very fact that different character
>>> >>>> sequences are needed to describe the same piece of knowledge makes
>>> >>>> this
>>> >>>> problem fall into the domain of multilinguality.
>>> >>>>
>>> >>>> So, the general idea is to use a single IRI per resource, but have
>>> >>>> two
>>> >>>> separate triples for any literal originally encoded in cyrillic. For
>>> >>>> example:
>>> >>>>
>>> >>>> <
>>> >>>>
>>> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>>
>>> >>>> ;>
>>> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>>> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
>>> >>>> <
>>> >>>>
>>> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>>
>>> >>>> ;>
>>> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>>> >>>>
>>> >>>> The above language tags are as per IANA Language Subtag Registry [1],
>>> >>>> which lists them as redundant, though, so a "sr" tag, instead, could
>>> >>>> be
>>> >>>> enough for both.
>>> >>>>
>>> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
>>> >>>> other
>>> >>>> information that would help me get started would be much appreciated!
>>> >>>>
>>> >>>> Best,
>>> >>>> Uros
>>> >>>>
>>> >>>> [1]
>>> >>>>
>>> >>>>
>>> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> ------------------------------------------------------------------------------
>>> >>>> Rapidly troubleshoot problems before they affect your business. Most
>>> >>>> IT
>>> >>>> organizations don't have a clear picture of how application
>>> >>>> performance
>>> >>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>> >>>> your
>>> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>> >>>> AppDynamics Pro!
>>> >>>>
>>> >>>>
>>> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>>> >>>> _______________________________________________
>>> >>>> Dbpedia-developers mailing list
>>> >>>> [email protected]
>>> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> ------------------------------------------------------------------------------
>>> >>> Rapidly troubleshoot problems before they affect your business. Most
>>> >>> IT
>>> >>> organizations don't have a clear picture of how application
>>> >>> performance
>>> >>> affects their revenue. With AppDynamics, you get 100% visibility into
>>> >>> your
>>> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>> >>> AppDynamics
>>> >>> Pro!
>>> >>>
>>> >>>
>>> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>>> >>> _______________________________________________
>>> >>> Dbpedia-developers mailing list
>>> >>> [email protected]
>>> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Kontokostas Dimitris
>>> >>
>>> >
>>>
>>>
>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to