Re: [Dbpedia-developers] Support for digraphia

Jona Christopher Sahnwaldt Tue, 03 Dec 2013 10:56:13 -0800

On 3 December 2013 18:19, Andrea Di Menna <[email protected]> wrote:
> 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>>
>> On 3 December 2013 16:54, Andrea Di Menna <[email protected]> wrote:
>> > Hi,
>> >
>> > I agree with JC that probably UriPolicy is not the best place.
>>
>> I guess extending UriPolicy looks attractive because modifying
>> literals has some common needs with modifying URIs. But we should
>> rather introduce a new class StringLiteralPolicy or so and move some
>> code from UriPolicy to a common base class. Maybe we can share the
>> policy parsing code etc. But literals and URIs are quite different and
>> should probably be handled by different classes.
>>
>> Maybe we need a new Destination subclass too (or instead). Actually,
>> if we follow YAGNI and KISS principles we should simply use a
>> SerbianTransliterationDestination...
>>
>> > As per Uros use case I understand that what he would like to obtain is a
>> > duplication of quads.
>> > Probably this should be done in the Formatters or maybe as a
>> > post-processing
>> > operation?
>> >
>> > The problem is the following:
>> > - some languages are officially digraphic, that is they can use two
>> > different scripts (e.g. latin and cyrillic scripts)
>> > - Serbian (sr) is a digraphic language (latin and cyrillic)
>> > - Serbian wikipedia allows users to see articles in latin and cyrillic,
>> > e.g.
>> > cyrillic:
>> >
>> > https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
>> > latin:
>> >
>> > https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
>> > - wikipedia dumps do not contain both versions but only cyrillic in 99%
>> > of
>> > the cases
>> > - if you were to extract string objects from the sr dump you would get
>> > cyrillic almost everywhere, for labels or for template property values
>>
>> I just looked at a few pages in the Serbian Wikipedia.
>>
>> There is a piece of MediaWiki syntax that I hadn't seen before:
>> wrapping text in -{...}- keeps it from being transliterated. In an
>> ideal world, we would extend the DBpedia parser to handle this...
>>
>> There are actually three ways a Serbian Wikipedia page can be
>> displayed: unchanged, transliterated to Cyrillic, transliterated to
>> Latin. For example, I put this wiki text on my Serbian Wikipedia user
>> page:
>>
>> Unprotected: Test
>> Protected: -{Test}-
>> Unprotected: Парсер
>> Protected: -{Парсер}-
>>
>> Depending on the URL, it is displayed in in different ways:
>>
>> http://sr.wikipedia.org/wiki/Корисник:Chrisahn or
>> http://sr.wikipedia.org/sr/Корисник:Chrisahn - unmodified
>>
>> Unprotected: Test
>> Protected: Test
>> Unprotected: Парсер
>> Protected: Парсер
>>
>> http://sr.wikipedia.org/sr-ec/Корисник:Chrisahn - transliterated to
>> Cyrillic unless protected
>>
>> Унпротецтед: Тест
>> Протецтед: Test
>> Унпротецтед: Парсер
>> Протецтед: Парсер
>>
>> http://sr.wikipedia.org/sr-el/Корисник:Chrisahn - transliterated to
>> Latin unless protected
>>
>> Unprotected: Test
>> Protected: Test
>> Unprotected: Parser
>> Protected: Парсер
>>
>
> But still the content in the dumps will be the same, i.e. the wikitext you
> have saved in your page.
> No matter how you render it on the Mediawiki instance which hosts it.
> Correct?


Correct.

>
>>
>>
>> >
>> > Uros is wondering what would happen if a serbian user searches using for
>> > example the latin transliterated version of a cyrillic label (e.g. using
>> > SPARQL on Virtuoso for example).
>> > Their search would probably fail (unless Virtuoso implements
>> > transliteration
>> > on-the-fly).
>> >
>> > Romanization or Cyrillization are transliteration methods which are also
>> > available through ICU4J
>> >
>> > [http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html]
>>
>> Looks good, but is there an implementation for Serbian? If there
>> isn't, this probably won't help us much. Not enough to justify adding
>> ICU4J as a new dependency, I think.
>>
>
> Yes there is a Transliterator with ID "Serbian-Latin/BGN" (a list here
> http://www.avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html,
> don't know if this is still valid)
> I have made some quick tests and it seems to work OK.

Cool!

>
>>
>> >
>> > I think it does not make sense to transliterate URIs but only string
>> > typed
>> > values.
>>
>> I don't know. Wikipedia seems to have some elaborate rules in place as
>> far as Latin/Cyrillic URLs are concerned. Maybe we should follow these
>> rules too?
>>
>
> Are the "preserve" rules also applied to wikilinks? If they are not then I
> think we should not apply transliteration to URIs.

According to a few tests on my user page, the text (title) displayed
for a Wiki link is transliterated unless it's "protected" by -{...}-.
The actual link target is *always* the Cyrillic version, even if the
wiki text contains the Latin article name. Example: [[Johan Volfgang
Gete]] always results in a link to
http://sr.wikipedia.org/wiki/Јохан_Волфганг_Гете .

If we want DBpedia to use the same policy, then we *should*
transliterate URIs. Currently, we always use the link target as it's
in the wiki source text. Example: for [[Johan Volfgang Gete]], we
generate a link to http://sr.dbpedia.org/resource/Johan_Volfgang_Gete
. To be consistent with Wikipedia, the link should point to
http://sr.dbpedia.org/resource/Јохан_Волфганг_Гете instead.

The main problem I see with transliterating URIs is configuration.
That's one of the main problems of DBpedia anyway. We're putting too
much effort into parsing configuration files. To allow transliteration
of URIs, we have to extend the UriPolicy syntax and parser, which is
already pretty convoluted anyway. If we used something like Spring
instead of self-made configuration stuff, we would simply add a class
and reference the class in the configuration. Additionally, we should
use different configuration objects for each language. That doesn't
have to mean that we need a separate configuration file for each
language, just that we have to initialize the extraction framework
differently for each language. This would also make UriPolicy
configuration easier.

JC

>
> Cheers!
> Andrea
>
>>
>> Cheers,
>> JC
>>
>> >
>> > Cheers
>> > Andrea
>> >
>> >
>> > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>> >>
>> >> Hi all,
>> >>
>> >> I don't think UriPolicy is a good place to do this...
>> >>
>> >> But anyway, I don't understand the problem yet. :-)
>> >>
>> >> Uros, you wrote about ISO 8859-2 and ISO 15924.
>> >>
>> >> ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia
>> >> is not using it, and I know that DBpedia is not using it. I think
>> >> Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML
>> >> dumps are UTF-8 encoded, and so are the DBpedia dumps.
>> >>
>> >> ISO 15924 is not a character encoding, but a way to specify the names
>> >> of scripts. See https://en.wikipedia.org/wiki/ISO_15924
>> >>
>> >> What would romanization or cyrillization do exactly? Is there a
>> >> one-to-one mapping between letters? Or letter sequences?
>> >>
>> >> Cheers,
>> >> JC
>> >>
>> >> On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]>
>> >> wrote:
>> >> > Hi Uros,
>> >> >
>> >> > Don't worry, as we said we are here to help if you get stuck;) we all
>> >> > started like this.
>> >> >
>> >> > If you look at the formatters package you will understand what's
>> >> > going
>> >> > on.
>> >> > We have formatters that write a triple based on some policies we
>> >> > define.
>> >> > We parse the policies at runtime, create formatters based on these
>> >> > policies
>> >> > and feed them to destinations.
>> >> >
>> >> > I think we should generalize URIPolicy to TriplePolicy and create a
>> >> > "transliterate" action.
>> >> > I made a change in the URIPolicy code to make it more descriptive [1]
>> >> > Right now we have support only for URIs but if you change the
>> >> > following
>> >> > it
>> >> > should be a good start to make your changes
>> >> >
>> >> >   //String: Uri or Literal, Boolean: is URI or not, String: output
>> >> > (new
>> >> > URI
>> >> > or transliterated string)
>> >> >   type Policy = (String, Boolean) => String
>> >> >
>> >> >   type PolicyApplicable = (String, Boolean) => Boolean
>> >> >
>> >> > I also submitted a feature request [2], you can make a proper
>> >> > description
>> >> > and continue the discussion there
>> >> >
>> >> > Cheers,
>> >> > Dimitris
>> >> >
>> >> >
>> >> > [1] https://github.com/dbpedia/extraction-framework/pull/131
>> >> > [2] https://github.com/dbpedia/extraction-framework/issues/130
>> >> >
>> >> >
>> >> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hi Andrea/Dimitris,
>> >> >>
>> >> >> Thanks for the tips. Actually, when I said I was no core expert, I
>> >> >> meant I
>> >> >> was an absolute beginner. :) I wanted to go with an extractor
>> >> >> because
>> >> >> that
>> >> >> seemed simpler (and safer) than meddling with the core. Most of the
>> >> >> stuff
>> >> >> in there still seems rather confusing, but I'll look into it.
>> >> >>
>> >> >> So, the UriPolicy code is where the triples get written (pointer to
>> >> >> the
>> >> >> exact line, anyone?), or is this simply where you'd like to place
>> >> >> the
>> >> >> new
>> >> >> code? Also, would "UriPolicy" remain an adequate name for the class,
>> >> >> then?
>> >> >>
>> >> >> Best,
>> >> >> Uros
>> >> >>
>> >> >>
>> >> >> > Maybe something like:
>> >> >> >
>> >> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>> >> >> >
>> >> >> > where you specify a list of (languageTag:transliterator) separated
>> >> >> > by
>> >> >> > ';'
>> >> >> > for one language?
>> >> >> > The transliterator could be either "identity" (no transformation)
>> >> >> > or
>> >> >> > a
>> >> >> > icu4j transliterator-ID.
>> >> >> >
>> >> >> > As Dimitris said, Uros please feel free to ask if you need help!
>> >> >> >
>> >> >> > Cheers
>> >> >> > Andrea
>> >> >> >
>> >> >> >
>> >> >> > 2013/11/30 Dimitris Kontokostas <[email protected]>
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>> >> >> >> <[email protected]>wrote:
>> >> >> >>
>> >> >> >>> Hello Uros,
>> >> >> >>>
>> >> >> >>> that's a really interesting problem :)
>> >> >> >>> I am no expert either but probably the best approach would be to
>> >> >> >>> "duplicate" triples when they are going to be written (e.g. in
>> >> >> >>> the
>> >> >> >>> destinations package), instead of modifying the extractors.
>> >> >> >>>
>> >> >> >>
>> >> >> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to
>> >> >> >> do
>> >> >> >> string object transformations (now it only applies to URIs /
>> >> >> >> IRIs)
>> >> >> >> and use the configuration files to select the desired output [2].
>> >> >> >> Uros, do you want to give it a shot? You can always ask for help
>> >> >> >> here
>> >> >> >> ;)
>> >> >> >>
>> >> >> >> [1]
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>> >> >> >> [2]
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>> >> >> >>
>> >> >> >>
>> >> >> >>> For what regards which tools to use, it looks like icu4j
>> >> >> >>> Translitterator
>> >> >> >>> suits your needs, e.g.
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
>> >> >> >>> 5
>> >> >> >>> (енгл. Malachor V) је измишљена планета у
>> >> >> >>> универзуму Ратова звезда.")
>> >> >> >>>
>> >> >> >>> results in
>> >> >> >>>
>> >> >> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu
>> >> >> >>> Ratova
>> >> >>
>> >> >> >>> zvezda.
>> >> >> >>>
>> >> >> >>> What do you think?
>> >> >> >>>
>> >> >> >>> Cheers
>> >> >> >>>  Andrea
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> 2013/11/29 Uros Milosevic <[email protected]>
>> >> >> >>>
>> >> >> >>>> Hi all,
>> >> >> >>>>
>> >> >> >>>> As some of you may know, a Serbian version of DBpedia is
>> >> >> >>>> currently
>> >> >> >>>> in
>> >> >> >>>> the
>> >> >> >>>> works. Now, Serbian, unlike any other language in Europe, is
>> >> >> >>>> digraphic
>> >> >> >>>> in
>> >> >> >>>> nature, officially supporting both (Serbian) Cyrillic and
>> >> >> >>>> (Gaj's)
>> >> >> >>>> Latin
>> >> >> >>>> alphabet. This is absolutely fine for storing information in
>> >> >> >>>> any
>> >> >> >>>> modern
>> >> >> >>>> knowledge base, but can often be a major obstacle for
>> >> >> >>>> information
>> >> >> >>>> retrieval.
>> >> >> >>>>
>> >> >> >>>> For instance, most Serbs rely on the Latin alphabet for
>> >> >> >>>> communication/interaction on the Web. That means a large
>> >> >> >>>> portion
>> >> >> >>>> of
>> >> >> >>>> the
>> >> >> >>>> information is (and, often, expected to be) encoded in ISO
>> >> >> >>>> 8859-2
>> >> >> >>>> (i.e.
>> >> >> >>>> Latin-2). And, yet, 99% of the information in the Serbian
>> >> >> >>>> Wikipedia
>> >> >> >>>> dumps
>> >> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your
>> >> >> >>>> software
>> >> >> >>>> performs
>> >> >> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization
>> >> >> >>>> (i.e.
>> >> >> >>>> vice
>> >> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be
>> >> >> >>>> doing
>> >> >> >>>> this),
>> >> >> >>>> many attempts at information extraction will be doomed to fail.
>> >> >> >>>> This
>> >> >> >>>> directly affects common tasks such as keyword search,
>> >> >> >>>> label-based
>> >> >> >>>> SPARQL
>> >> >> >>>> querying, named entity recognition, etc.
>> >> >> >>>>
>> >> >> >>>> What I would like to do is improve some of the existing DBpedia
>> >> >> >>>> extractors, or develop new ones, that would take this problem
>> >> >> >>>> into
>> >> >> >>>> consideration and perform romanization of Wikipedia dumps so as
>> >> >> >>>> to
>> >> >> >>>> output
>> >> >> >>>> information encoded in *both* scripts. Now, I know storing the
>> >> >> >>>> same
>> >> >> >>>> information twice might not be the most elegant solution, but
>> >> >> >>>> unless
>> >> >> >>>> someone is to include romanization/cyrillization features in
>> >> >> >>>> the
>> >> >> >>>> next
>> >> >> >>>> version of SPARQL, I don't see a better solution at the moment.
>> >> >> >>>> Of
>> >> >> >>>> course,
>> >> >> >>>> there is also the matter of perspective - one could argue that
>> >> >> >>>> although
>> >> >> >>>> the information is the same, the very fact that different
>> >> >> >>>> character
>> >> >> >>>> sequences are needed to describe the same piece of knowledge
>> >> >> >>>> makes
>> >> >> >>>> this
>> >> >> >>>> problem fall into the domain of multilinguality.
>> >> >> >>>>
>> >> >> >>>> So, the general idea is to use a single IRI per resource, but
>> >> >> >>>> have
>> >> >> >>>> two
>> >> >> >>>> separate triples for any literal originally encoded in
>> >> >> >>>> cyrillic.
>> >> >> >>>> For
>> >> >> >>>> example:
>> >> >> >>>>
>> >> >> >>>> <
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> >> >>
>> >> >> >>>> ;>
>> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>> >> >> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
>> >> >> >>>> <
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
>> >> >>
>> >> >> >>>> ;>
>> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
>> >> >> >>>>
>> >> >> >>>> The above language tags are as per IANA Language Subtag
>> >> >> >>>> Registry
>> >> >> >>>> [1],
>> >> >> >>>> which lists them as redundant, though, so a "sr" tag, instead,
>> >> >> >>>> could
>> >> >> >>>> be
>> >> >> >>>> enough for both.
>> >> >> >>>>
>> >> >> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or
>> >> >> >>>> any
>> >> >> >>>> other
>> >> >> >>>> information that would help me get started would be much
>> >> >> >>>> appreciated!
>> >> >> >>>>
>> >> >> >>>> Best,
>> >> >> >>>> Uros
>> >> >> >>>>
>> >> >> >>>> [1]
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> ------------------------------------------------------------------------------
>> >> >> >>>> Rapidly troubleshoot problems before they affect your business.
>> >> >> >>>> Most
>> >> >> >>>> IT
>> >> >> >>>> organizations don't have a clear picture of how application
>> >> >> >>>> performance
>> >> >> >>>> affects their revenue. With AppDynamics, you get 100%
>> >> >> >>>> visibility
>> >> >> >>>> into
>> >> >> >>>> your
>> >> >> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> >> >>>> AppDynamics Pro!
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >> >>>> _______________________________________________
>> >> >> >>>> Dbpedia-developers mailing list
>> >> >> >>>> [email protected]
>> >> >> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> ------------------------------------------------------------------------------
>> >> >> >>> Rapidly troubleshoot problems before they affect your business.
>> >> >> >>> Most
>> >> >> >>> IT
>> >> >> >>> organizations don't have a clear picture of how application
>> >> >> >>> performance
>> >> >> >>> affects their revenue. With AppDynamics, you get 100% visibility
>> >> >> >>> into
>> >> >> >>> your
>> >> >> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> >> >>> AppDynamics
>> >> >> >>> Pro!
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >> >>> _______________________________________________
>> >> >> >>> Dbpedia-developers mailing list
>> >> >> >>> [email protected]
>> >> >> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Kontokostas Dimitris
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Kontokostas Dimitris
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > Rapidly troubleshoot problems before they affect your business. Most
>> >> > IT
>> >> > organizations don't have a clear picture of how application
>> >> > performance
>> >> > affects their revenue. With AppDynamics, you get 100% visibility into
>> >> > your
>> >> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> > AppDynamics
>> >> > Pro!
>> >> >
>> >> >
>> >> > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> > _______________________________________________
>> >> > Dbpedia-developers mailing list
>> >> > [email protected]
>> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Rapidly troubleshoot problems before they affect your business. Most IT
>> >> organizations don't have a clear picture of how application performance
>> >> affects their revenue. With AppDynamics, you get 100% visibility into
>> >> your
>> >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> >> AppDynamics
>> >> Pro!
>> >>
>> >>
>> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> _______________________________________________
>> >> Dbpedia-developers mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >
>> >
>
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to