Re: [Dbpedia-developers] Support for digraphia

Andrea Di Menna Tue, 03 Dec 2013 07:55:42 -0800

Hi,

I agree with JC that probably UriPolicy is not the best place.
As per Uros use case I understand that what he would like to obtain is a
duplication of quads.
Probably this should be done in the Formatters or maybe as a
post-processing operation?


The problem is the following:
- some languages are officially digraphic, that is they can use two
different scripts (e.g. latin and cyrillic scripts)
- Serbian (sr) is a digraphic language (latin and cyrillic)
- Serbian wikipedia allows users to see articles in latin and cyrillic, e.g.
cyrillic:
https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
latin:
https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
- wikipedia dumps do not contain both versions but only cyrillic in 99% of
the cases
- if you were to extract string objects from the sr dump you would get
cyrillic almost everywhere, for labels or for template property values

Uros is wondering what would happen if a serbian user searches using for
example the latin transliterated version of a cyrillic label (e.g. using
SPARQL on Virtuoso for example).
Their search would probably fail (unless Virtuoso implements
transliteration on-the-fly).

Romanization or Cyrillization are transliteration methods which are also
available through ICU4J [
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html]

I think it does not make sense to transliterate URIs but only string typed
values.

Cheers
Andrea


2013/12/3 Jona Christopher Sahnwaldt <[email protected]>

> Hi all,
>
> I don't think UriPolicy is a good place to do this...
>
> But anyway, I don't understand the problem yet. :-)
>
> Uros, you wrote about ISO 8859-2 and ISO 15924.
>
> ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia
> is not using it, and I know that DBpedia is not using it. I think
> Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML
> dumps are UTF-8 encoded, and so are the DBpedia dumps.
>
> ISO 15924 is not a character encoding, but a way to specify the names
> of scripts. See https://en.wikipedia.org/wiki/ISO_15924
>
> What would romanization or cyrillization do exactly? Is there a
> one-to-one mapping between letters? Or letter sequences?
>
> Cheers,
> JC
>
> On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote:
> > Hi Uros,
> >
> > Don't worry, as we said we are here to help if you get stuck;) we all
> > started like this.
> >
> > If you look at the formatters package you will understand what's going
> on.
> > We have formatters that write a triple based on some policies we define.
> > We parse the policies at runtime, create formatters based on these
> policies
> > and feed them to destinations.
> >
> > I think we should generalize URIPolicy to TriplePolicy and create a
> > "transliterate" action.
> > I made a change in the URIPolicy code to make it more descriptive [1]
> > Right now we have support only for URIs but if you change the following
> it
> > should be a good start to make your changes
> >
> >   //String: Uri or Literal, Boolean: is URI or not, String: output (new
> URI
> > or transliterated string)
> >   type Policy = (String, Boolean) => String
> >
> >   type PolicyApplicable = (String, Boolean) => Boolean
> >
> > I also submitted a feature request [2], you can make a proper description
> > and continue the discussion there
> >
> > Cheers,
> > Dimitris
> >
> >
> > [1] https://github.com/dbpedia/extraction-framework/pull/131
> > [2] https://github.com/dbpedia/extraction-framework/issues/130
> >
> >
> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]>
> > wrote:
> >>
> >> Hi Andrea/Dimitris,
> >>
> >> Thanks for the tips. Actually, when I said I was no core expert, I
> meant I
> >> was an absolute beginner. :) I wanted to go with an extractor because
> that
> >> seemed simpler (and safer) than meddling with the core. Most of the
> stuff
> >> in there still seems rather confusing, but I'll look into it.
> >>
> >> So, the UriPolicy code is where the triples get written (pointer to the
> >> exact line, anyone?), or is this simply where you'd like to place the
> new
> >> code? Also, would "UriPolicy" remain an adequate name for the class,
> then?
> >>
> >> Best,
> >> Uros
> >>
> >>
> >> > Maybe something like:
> >> >
> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
> >> >
> >> > where you specify a list of (languageTag:transliterator) separated by
> >> > ';'
> >> > for one language?
> >> > The transliterator could be either "identity" (no transformation) or a
> >> > icu4j transliterator-ID.
> >> >
> >> > As Dimitris said, Uros please feel free to ask if you need help!
> >> >
> >> > Cheers
> >> > Andrea
> >> >
> >> >
> >> > 2013/11/30 Dimitris Kontokostas <[email protected]>
> >> >
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
> >> >> <[email protected]>wrote:
> >> >>
> >> >>> Hello Uros,
> >> >>>
> >> >>> that's a really interesting problem :)
> >> >>> I am no expert either but probably the best approach would be to
> >> >>> "duplicate" triples when they are going to be written (e.g. in the
> >> >>> destinations package), instead of modifying the extractors.
> >> >>>
> >> >>
> >> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
> >> >> string object transformations (now it only applies to URIs / IRIs)
> >> >> and use the configuration files to select the desired output [2].
> >> >> Uros, do you want to give it a shot? You can always ask for help here
> >> >> ;)
> >> >>
> >> >> [1]
> >> >>
> >> >>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
> >> >> [2]
> >> >>
> >> >>
> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
> >> >>
> >> >>
> >> >>> For what regards which tools to use, it looks like icu4j
> >> >>> Translitterator
> >> >>> suits your needs, e.g.
> >> >>>
> >> >>>
> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
> >> >>> 5
> >> >>> (енгл. Malachor V) је измишљена планета у
> >> >>> универзуму Ратова звезда.")
> >> >>>
> >> >>> results in
> >> >>>
> >> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu
> Ratova
> >>
> >> >>> zvezda.
> >> >>>
> >> >>> What do you think?
> >> >>>
> >> >>> Cheers
> >> >>>  Andrea
> >> >>>
> >> >>>
> >> >>> 2013/11/29 Uros Milosevic <[email protected]>
> >> >>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> As some of you may know, a Serbian version of DBpedia is currently
> in
> >> >>>> the
> >> >>>> works. Now, Serbian, unlike any other language in Europe, is
> >> >>>> digraphic
> >> >>>> in
> >> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
> >> >>>> Latin
> >> >>>> alphabet. This is absolutely fine for storing information in any
> >> >>>> modern
> >> >>>> knowledge base, but can often be a major obstacle for information
> >> >>>> retrieval.
> >> >>>>
> >> >>>> For instance, most Serbs rely on the Latin alphabet for
> >> >>>> communication/interaction on the Web. That means a large portion of
> >> >>>> the
> >> >>>> information is (and, often, expected to be) encoded in ISO 8859-2
> >> >>>> (i.e.
> >> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia
> >> >>>> dumps
> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
> >> >>>> performs
> >> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization
> (i.e.
> >> >>>> vice
> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
> >> >>>> this),
> >> >>>> many attempts at information extraction will be doomed to fail.
> This
> >> >>>> directly affects common tasks such as keyword search, label-based
> >> >>>> SPARQL
> >> >>>> querying, named entity recognition, etc.
> >> >>>>
> >> >>>> What I would like to do is improve some of the existing DBpedia
> >> >>>> extractors, or develop new ones, that would take this problem into
> >> >>>> consideration and perform romanization of Wikipedia dumps so as to
> >> >>>> output
> >> >>>> information encoded in *both* scripts. Now, I know storing the same
> >> >>>> information twice might not be the most elegant solution, but
> unless
> >> >>>> someone is to include romanization/cyrillization features in the
> next
> >> >>>> version of SPARQL, I don't see a better solution at the moment. Of
> >> >>>> course,
> >> >>>> there is also the matter of perspective - one could argue that
> >> >>>> although
> >> >>>> the information is the same, the very fact that different character
> >> >>>> sequences are needed to describe the same piece of knowledge makes
> >> >>>> this
> >> >>>> problem fall into the domain of multilinguality.
> >> >>>>
> >> >>>> So, the general idea is to use a single IRI per resource, but have
> >> >>>> two
> >> >>>> separate triples for any literal originally encoded in cyrillic.
> For
> >> >>>> example:
> >> >>>>
> >> >>>> <
> >> >>>>
> >> >>>>
> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<
> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
> >
> >>
> >> >>>> ;>
> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
> >> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl .
> >> >>>> <
> >> >>>>
> >> >>>>
> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<
> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
> >
> >>
> >> >>>> ;>
> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
> >> >>>>
> >> >>>> The above language tags are as per IANA Language Subtag Registry
> [1],
> >> >>>> which lists them as redundant, though, so a "sr" tag, instead,
> could
> >> >>>> be
> >> >>>> enough for both.
> >> >>>>
> >> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
> >> >>>> other
> >> >>>> information that would help me get started would be much
> appreciated!
> >> >>>>
> >> >>>> Best,
> >> >>>> Uros
> >> >>>>
> >> >>>> [1]
> >> >>>>
> >> >>>>
> >> >>>>
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> ------------------------------------------------------------------------------
> >> >>>> Rapidly troubleshoot problems before they affect your business.
> Most
> >> >>>> IT
> >> >>>> organizations don't have a clear picture of how application
> >> >>>> performance
> >> >>>> affects their revenue. With AppDynamics, you get 100% visibility
> into
> >> >>>> your
> >> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> >> >>>> AppDynamics Pro!
> >> >>>>
> >> >>>>
> >> >>>>
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> >> >>>> _______________________________________________
> >> >>>> Dbpedia-developers mailing list
> >> >>>> [email protected]
> >> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> ------------------------------------------------------------------------------
> >> >>> Rapidly troubleshoot problems before they affect your business. Most
> >> >>> IT
> >> >>> organizations don't have a clear picture of how application
> >> >>> performance
> >> >>> affects their revenue. With AppDynamics, you get 100% visibility
> into
> >> >>> your
> >> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> >> >>> AppDynamics
> >> >>> Pro!
> >> >>>
> >> >>>
> >> >>>
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> >> >>> _______________________________________________
> >> >>> Dbpedia-developers mailing list
> >> >>> [email protected]
> >> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Kontokostas Dimitris
> >> >>
> >> >
> >>
> >>
> >
> >
> >
> > --
> > Kontokostas Dimitris
> >
> >
> ------------------------------------------------------------------------------
> > Rapidly troubleshoot problems before they affect your business. Most IT
> > organizations don't have a clear picture of how application performance
> > affects their revenue. With AppDynamics, you get 100% visibility into
> your
> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> > Pro!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Dbpedia-developers mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to