Re: [Dbpedia-developers] Support for digraphia

Jona Christopher Sahnwaldt Tue, 03 Dec 2013 07:34:16 -0800

Hi all,

I don't think UriPolicy is a good place to do this...


But anyway, I don't understand the problem yet. :-)

Uros, you wrote about ISO 8859-2 and ISO 15924.

ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia
is not using it, and I know that DBpedia is not using it. I think
Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML
dumps are UTF-8 encoded, and so are the DBpedia dumps.

ISO 15924 is not a character encoding, but a way to specify the names
of scripts. See https://en.wikipedia.org/wiki/ISO_15924

What would romanization or cyrillization do exactly? Is there a
one-to-one mapping between letters? Or letter sequences?

Cheers,
JC

On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote:
> Hi Uros,
>
> Don't worry, as we said we are here to help if you get stuck;) we all
> started like this.
>
> If you look at the formatters package you will understand what's going on.
> We have formatters that write a triple based on some policies we define.
> We parse the policies at runtime, create formatters based on these policies
> and feed them to destinations.
>
> I think we should generalize URIPolicy to TriplePolicy and create a
> "transliterate" action.
> I made a change in the URIPolicy code to make it more descriptive [1]
> Right now we have support only for URIs but if you change the following it
> should be a good start to make your changes
>
>   //String: Uri or Literal, Boolean: is URI or not, String: output (new URI
> or transliterated string)
>   type Policy = (String, Boolean) => String
>
>   type PolicyApplicable = (String, Boolean) => Boolean
>
> I also submitted a feature request [2], you can make a proper description
> and continue the discussion there
>
> Cheers,
> Dimitris
>
>
> [1] https://github.com/dbpedia/extraction-framework/pull/131
> [2] https://github.com/dbpedia/extraction-framework/issues/130
>
>
> On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]>
> wrote:
>>
>> Hi Andrea/Dimitris,
>>
>> Thanks for the tips. Actually, when I said I was no core expert, I meant I
>> was an absolute beginner. :) I wanted to go with an extractor because that
>> seemed simpler (and safer) than meddling with the core. Most of the stuff
>> in there still seems rather confusing, but I'll look into it.
>>
>> So, the UriPolicy code is where the triples get written (pointer to the
>> exact line, anyone?), or is this simply where you'd like to place the new
>> code? Also, would "UriPolicy" remain an adequate name for the class, then?
>>
>> Best,
>> Uros
>>
>>
>> > Maybe something like:
>> >
>> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>> >
>> > where you specify a list of (languageTag:transliterator) separated by
>> > ';'
>> > for one language?
>> > The transliterator could be either "identity" (no transformation) or a
>> > icu4j transliterator-ID.
>> >
>> > As Dimitris said, Uros please feel free to ask if you need help!
>> >
>> > Cheers
>> > Andrea
>> >
>> >
>> > 2013/11/30 Dimitris Kontokostas <[email protected]>
>> >
>> >>
>> >>
>> >>
>> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>> >> <[email protected]>wrote:
>> >>
>> >>> Hello Uros,
>> >>>
>> >>> that's a really interesting problem :)
>> >>> I am no expert either but probably the best approach would be to
>> >>> "duplicate" triples when they are going to be written (e.g. in the
>> >>> destinations package), instead of modifying the extractors.
>> >>>
>> >>
>> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
>> >> string object transformations (now it only applies to URIs / IRIs)
>> >> and use the configuration files to select the desired output [2].
>> >> Uros, do you want to give it a shot? You can always ask for help here
>> >> ;)
>> >>
>> >> [1]
>> >>
>> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>> >> [2]
>> >>
>> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>> >>
>> >>
>> >>> For what regards which tools to use, it looks like icu4j
>> >>> Translitterator
>> >>> suits your needs, e.g.
>> >>>
>> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
>> >>> 5
>> >>> (енгл. Malachor V) је измишљена планета у
>> >>> универзуму Ратова звезда.")
>> >>>
>> >>> results in
>> >>>
>> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
>>
>> >>> zvezda.
>> >>>
>> >>> What do you think?
>> >>>
>> >>> Cheers
>> >>>  Andrea
>> >>>
>> >>>
>> >>> 2013/11/29 Uros Milosevic <[email protected]>
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> As some of you may know, a Serbian version of DBpedia is currently in
>> >>>> the
>> >>>> works. Now, Serbian, unlike any other language in Europe, is
>> >>>> digraphic
>> >>>> in
>> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
>> >>>> Latin
>> >>>> alphabet. This is absolutely fine for storing information in any
>> >>>> modern
>> >>>> knowledge base, but can often be a major obstacle for information
>> >>>> retrieval.
>> >>>>
>> >>>> For instance, most Serbs rely on the Latin alphabet for
>> >>>> communication/interaction on the Web. That means a large portion of
>> >>>> the
>> >>>> information is (and, often, expected to be) encoded in ISO 8859-2
>> >>>> (i.e.
>> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia
>> >>>> dumps
>> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
>> >>>> performs
>> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
>> >>>> vice
>> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
>> >>>> this),
>> >>>> many attempts at information extraction will be doomed to fail. This
>> >>>> directly affects common tasks such as keyword search, label-based
>> >>>> SPARQL
>> >>>> querying, named entity recognition, etc.
>> >>>>
>> >>>> What I would like to do is improve some of the existing DBpedia
>> >>>> extractors, or develop new ones, that would take this problem into
>> >>>> consideration and perform romanization of Wikipedia dumps so as to
>> >>>> output
>> >>>> information encoded in *both* scripts. Now, I know storing the same
>> >>>> information twice might not be the most elegant solution, but unless
>> >>>> someone is to include romanization/cyrillization features in the next
>> >>>> version of SPARQL, I don't see a better solution at the moment. Of
>> >>>> course,
>> >>>> there is also the matter of perspective - one could argue that
>> >>>> although
>> >>>> the information is the same, the very fact that different character
>> >>>> sequences are needed to describe the same piece of knowledge makes
>> >>>> this
>> >>>> problem fall into the domain of multilinguality.
>> >>>>
>> >>>> So, the general idea is to use a single IRI per resource, but have
>> >>>> two
>> >>>> separate triples for any literal originally encoded in cyrillic. For
>> >>>> example:
>> >>>>
>> >>>> <
>> >>>>
>> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>
>> >>>> ;>
>> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
>> >>>> <
>> >>>>
>> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>>
>> >>>> ;>
>> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>> >>>>
>> >>>> The above language tags are as per IANA Language Subtag Registry [1],
>> >>>> which lists them as redundant, though, so a "sr" tag, instead, could
>> >>>> be
>> >>>> enough for both.
>> >>>>
>> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
>> >>>> other
>> >>>> information that would help me get started would be much appreciated!
>> >>>>
>> >>>> Best,
>> >>>> Uros
>> >>>>
>> >>>> [1]
>> >>>>
>> >>>>
>> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> ------------------------------------------------------------------------------
>> >>>> Rapidly troubleshoot problems before they affect your business. Most
>> >>>> IT
>> >>>> organizations don't have a clear picture of how application
>> >>>> performance
>> >>>> affects their revenue. With AppDynamics, you get 100% visibility into
>> >>>> your
>> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >>>> AppDynamics Pro!
>> >>>>
>> >>>>
>> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >>>> _______________________________________________
>> >>>> Dbpedia-developers mailing list
>> >>>> [email protected]
>> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------------------------------------------------------
>> >>> Rapidly troubleshoot problems before they affect your business. Most
>> >>> IT
>> >>> organizations don't have a clear picture of how application
>> >>> performance
>> >>> affects their revenue. With AppDynamics, you get 100% visibility into
>> >>> your
>> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >>> AppDynamics
>> >>> Pro!
>> >>>
>> >>>
>> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >>> _______________________________________________
>> >>> Dbpedia-developers mailing list
>> >>> [email protected]
>> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kontokostas Dimitris
>> >>
>> >
>>
>>
>
>
>
> --
> Kontokostas Dimitris
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to