Re: [Dbpedia-developers] Support for digraphia

Jona Christopher Sahnwaldt Tue, 03 Dec 2013 09:02:19 -0800

On 3 December 2013 16:54, Andrea Di Menna <[email protected]> wrote:
> Hi,
>
> I agree with JC that probably UriPolicy is not the best place.


I guess extending UriPolicy looks attractive because modifying
literals has some common needs with modifying URIs. But we should
rather introduce a new class StringLiteralPolicy or so and move some
code from UriPolicy to a common base class. Maybe we can share the
policy parsing code etc. But literals and URIs are quite different and
should probably be handled by different classes.

Maybe we need a new Destination subclass too (or instead). Actually,
if we follow YAGNI and KISS principles we should simply use a
SerbianTransliterationDestination...

> As per Uros use case I understand that what he would like to obtain is a
> duplication of quads.
> Probably this should be done in the Formatters or maybe as a post-processing
> operation?
>
> The problem is the following:
> - some languages are officially digraphic, that is they can use two
> different scripts (e.g. latin and cyrillic scripts)
> - Serbian (sr) is a digraphic language (latin and cyrillic)
> - Serbian wikipedia allows users to see articles in latin and cyrillic, e.g.
> cyrillic:
> https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
> latin:
> https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
> - wikipedia dumps do not contain both versions but only cyrillic in 99% of
> the cases
> - if you were to extract string objects from the sr dump you would get
> cyrillic almost everywhere, for labels or for template property values

I just looked at a few pages in the Serbian Wikipedia.

There is a piece of MediaWiki syntax that I hadn't seen before:
wrapping text in -{...}- keeps it from being transliterated. In an
ideal world, we would extend the DBpedia parser to handle this...

There are actually three ways a Serbian Wikipedia page can be
displayed: unchanged, transliterated to Cyrillic, transliterated to
Latin. For example, I put this wiki text on my Serbian Wikipedia user
page:

Unprotected: Test
Protected: -{Test}-
Unprotected: Парсер
Protected: -{Парсер}-

Depending on the URL, it is displayed in in different ways:

http://sr.wikipedia.org/wiki/Корисник:Chrisahn or
http://sr.wikipedia.org/sr/Корисник:Chrisahn - unmodified

Unprotected: Test
Protected: Test
Unprotected: Парсер
Protected: Парсер

http://sr.wikipedia.org/sr-ec/Корисник:Chrisahn - transliterated to
Cyrillic unless protected

Унпротецтед: Тест
Протецтед: Test
Унпротецтед: Парсер
Протецтед: Парсер

http://sr.wikipedia.org/sr-el/Корисник:Chrisahn - transliterated to
Latin unless protected

Unprotected: Test
Protected: Test
Unprotected: Parser
Protected: Парсер


>
> Uros is wondering what would happen if a serbian user searches using for
> example the latin transliterated version of a cyrillic label (e.g. using
> SPARQL on Virtuoso for example).
> Their search would probably fail (unless Virtuoso implements transliteration
> on-the-fly).
>
> Romanization or Cyrillization are transliteration methods which are also
> available through ICU4J
> [http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html]

Looks good, but is there an implementation for Serbian? If there
isn't, this probably won't help us much. Not enough to justify adding
ICU4J as a new dependency, I think.

>
> I think it does not make sense to transliterate URIs but only string typed
> values.

I don't know. Wikipedia seems to have some elaborate rules in place as
far as Latin/Cyrillic URLs are concerned. Maybe we should follow these
rules too?

Cheers,
JC

>
> Cheers
> Andrea
>
>
> 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>>
>> Hi all,
>>
>> I don't think UriPolicy is a good place to do this...
>>
>> But anyway, I don't understand the problem yet. :-)
>>
>> Uros, you wrote about ISO 8859-2 and ISO 15924.
>>
>> ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia
>> is not using it, and I know that DBpedia is not using it. I think
>> Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML
>> dumps are UTF-8 encoded, and so are the DBpedia dumps.
>>
>> ISO 15924 is not a character encoding, but a way to specify the names
>> of scripts. See https://en.wikipedia.org/wiki/ISO_15924
>>
>> What would romanization or cyrillization do exactly? Is there a
>> one-to-one mapping between letters? Or letter sequences?
>>
>> Cheers,
>> JC
>>
>> On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote:
>> > Hi Uros,
>> >
>> > Don't worry, as we said we are here to help if you get stuck;) we all
>> > started like this.
>> >
>> > If you look at the formatters package you will understand what's going
>> > on.
>> > We have formatters that write a triple based on some policies we define.
>> > We parse the policies at runtime, create formatters based on these
>> > policies
>> > and feed them to destinations.
>> >
>> > I think we should generalize URIPolicy to TriplePolicy and create a
>> > "transliterate" action.
>> > I made a change in the URIPolicy code to make it more descriptive [1]
>> > Right now we have support only for URIs but if you change the following
>> > it
>> > should be a good start to make your changes
>> >
>> >   //String: Uri or Literal, Boolean: is URI or not, String: output (new
>> > URI
>> > or transliterated string)
>> >   type Policy = (String, Boolean) => String
>> >
>> >   type PolicyApplicable = (String, Boolean) => Boolean
>> >
>> > I also submitted a feature request [2], you can make a proper
>> > description
>> > and continue the discussion there
>> >
>> > Cheers,
>> > Dimitris
>> >
>> >
>> > [1] https://github.com/dbpedia/extraction-framework/pull/131
>> > [2] https://github.com/dbpedia/extraction-framework/issues/130
>> >
>> >
>> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]>
>> > wrote:
>> >>
>> >> Hi Andrea/Dimitris,
>> >>
>> >> Thanks for the tips. Actually, when I said I was no core expert, I
>> >> meant I
>> >> was an absolute beginner. :) I wanted to go with an extractor because
>> >> that
>> >> seemed simpler (and safer) than meddling with the core. Most of the
>> >> stuff
>> >> in there still seems rather confusing, but I'll look into it.
>> >>
>> >> So, the UriPolicy code is where the triples get written (pointer to the
>> >> exact line, anyone?), or is this simply where you'd like to place the
>> >> new
>> >> code? Also, would "UriPolicy" remain an adequate name for the class,
>> >> then?
>> >>
>> >> Best,
>> >> Uros
>> >>
>> >>
>> >> > Maybe something like:
>> >> >
>> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>> >> >
>> >> > where you specify a list of (languageTag:transliterator) separated by
>> >> > ';'
>> >> > for one language?
>> >> > The transliterator could be either "identity" (no transformation) or
>> >> > a
>> >> > icu4j transliterator-ID.
>> >> >
>> >> > As Dimitris said, Uros please feel free to ask if you need help!
>> >> >
>> >> > Cheers
>> >> > Andrea
>> >> >
>> >> >
>> >> > 2013/11/30 Dimitris Kontokostas <[email protected]>
>> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>> >> >> <[email protected]>wrote:
>> >> >>
>> >> >>> Hello Uros,
>> >> >>>
>> >> >>> that's a really interesting problem :)
>> >> >>> I am no expert either but probably the best approach would be to
>> >> >>> "duplicate" triples when they are going to be written (e.g. in the
>> >> >>> destinations package), instead of modifying the extractors.
>> >> >>>
>> >> >>
>> >> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
>> >> >> string object transformations (now it only applies to URIs / IRIs)
>> >> >> and use the configuration files to select the desired output [2].
>> >> >> Uros, do you want to give it a shot? You can always ask for help
>> >> >> here
>> >> >> ;)
>> >> >>
>> >> >> [1]
>> >> >>
>> >> >>
>> >> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>> >> >> [2]
>> >> >>
>> >> >>
>> >> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>> >> >>
>> >> >>
>> >> >>> For what regards which tools to use, it looks like icu4j
>> >> >>> Translitterator
>> >> >>> suits your needs, e.g.
>> >> >>>
>> >> >>>
>> >> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
>> >> >>> 5
>> >> >>> (енгл. Malachor V) је измишљена планета у
>> >> >>> универзуму Ратова звезда.")
>> >> >>>
>> >> >>> results in
>> >> >>>
>> >> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu
>> >> >>> Ratova
>> >>
>> >> >>> zvezda.
>> >> >>>
>> >> >>> What do you think?
>> >> >>>
>> >> >>> Cheers
>> >> >>>  Andrea
>> >> >>>
>> >> >>>
>> >> >>> 2013/11/29 Uros Milosevic <[email protected]>
>> >> >>>
>> >> >>>> Hi all,
>> >> >>>>
>> >> >>>> As some of you may know, a Serbian version of DBpedia is currently
>> >> >>>> in
>> >> >>>> the
>> >> >>>> works. Now, Serbian, unlike any other language in Europe, is
>> >> >>>> digraphic
>> >> >>>> in
>> >> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
>> >> >>>> Latin
>> >> >>>> alphabet. This is absolutely fine for storing information in any
>> >> >>>> modern
>> >> >>>> knowledge base, but can often be a major obstacle for information
>> >> >>>> retrieval.
>> >> >>>>
>> >> >>>> For instance, most Serbs rely on the Latin alphabet for
>> >> >>>> communication/interaction on the Web. That means a large portion
>> >> >>>> of
>> >> >>>> the
>> >> >>>> information is (and, often, expected to be) encoded in ISO 8859-2
>> >> >>>> (i.e.
>> >> >>>> Latin-2). And, yet, 99% of the information in the Serbian
>> >> >>>> Wikipedia
>> >> >>>> dumps
>> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
>> >> >>>> performs
>> >> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization
>> >> >>>> (i.e.
>> >> >>>> vice
>> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be
>> >> >>>> doing
>> >> >>>> this),
>> >> >>>> many attempts at information extraction will be doomed to fail.
>> >> >>>> This
>> >> >>>> directly affects common tasks such as keyword search, label-based
>> >> >>>> SPARQL
>> >> >>>> querying, named entity recognition, etc.
>> >> >>>>
>> >> >>>> What I would like to do is improve some of the existing DBpedia
>> >> >>>> extractors, or develop new ones, that would take this problem into
>> >> >>>> consideration and perform romanization of Wikipedia dumps so as to
>> >> >>>> output
>> >> >>>> information encoded in *both* scripts. Now, I know storing the
>> >> >>>> same
>> >> >>>> information twice might not be the most elegant solution, but
>> >> >>>> unless
>> >> >>>> someone is to include romanization/cyrillization features in the
>> >> >>>> next
>> >> >>>> version of SPARQL, I don't see a better solution at the moment. Of
>> >> >>>> course,
>> >> >>>> there is also the matter of perspective - one could argue that
>> >> >>>> although
>> >> >>>> the information is the same, the very fact that different
>> >> >>>> character
>> >> >>>> sequences are needed to describe the same piece of knowledge makes
>> >> >>>> this
>> >> >>>> problem fall into the domain of multilinguality.
>> >> >>>>
>> >> >>>> So, the general idea is to use a single IRI per resource, but have
>> >> >>>> two
>> >> >>>> separate triples for any literal originally encoded in cyrillic.
>> >> >>>> For
>> >> >>>> example:
>> >> >>>>
>> >> >>>> <
>> >> >>>>
>> >> >>>>
>> >> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> >>
>> >> >>>> ;>
>> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>> >> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
>> >> >>>> <
>> >> >>>>
>> >> >>>>
>> >> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> >>
>> >> >>>> ;>
>> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>> >> >>>>
>> >> >>>> The above language tags are as per IANA Language Subtag Registry
>> >> >>>> [1],
>> >> >>>> which lists them as redundant, though, so a "sr" tag, instead,
>> >> >>>> could
>> >> >>>> be
>> >> >>>> enough for both.
>> >> >>>>
>> >> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
>> >> >>>> other
>> >> >>>> information that would help me get started would be much
>> >> >>>> appreciated!
>> >> >>>>
>> >> >>>> Best,
>> >> >>>> Uros
>> >> >>>>
>> >> >>>> [1]
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> ------------------------------------------------------------------------------
>> >> >>>> Rapidly troubleshoot problems before they affect your business.
>> >> >>>> Most
>> >> >>>> IT
>> >> >>>> organizations don't have a clear picture of how application
>> >> >>>> performance
>> >> >>>> affects their revenue. With AppDynamics, you get 100% visibility
>> >> >>>> into
>> >> >>>> your
>> >> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> >>>> AppDynamics Pro!
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>>> _______________________________________________
>> >> >>>> Dbpedia-developers mailing list
>> >> >>>> [email protected]
>> >> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> ------------------------------------------------------------------------------
>> >> >>> Rapidly troubleshoot problems before they affect your business.
>> >> >>> Most
>> >> >>> IT
>> >> >>> organizations don't have a clear picture of how application
>> >> >>> performance
>> >> >>> affects their revenue. With AppDynamics, you get 100% visibility
>> >> >>> into
>> >> >>> your
>> >> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> >>> AppDynamics
>> >> >>> Pro!
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>> _______________________________________________
>> >> >>> Dbpedia-developers mailing list
>> >> >>> [email protected]
>> >> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kontokostas Dimitris
>> >> >>
>> >> >
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Kontokostas Dimitris
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Rapidly troubleshoot problems before they affect your business. Most IT
>> > organizations don't have a clear picture of how application performance
>> > affects their revenue. With AppDynamics, you get 100% visibility into
>> > your
>> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> > AppDynamics
>> > Pro!
>> >
>> > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Dbpedia-developers mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to