Re: [Dbpedia-developers] Support for digraphia

Uros Milosevic Wed, 04 Dec 2013 23:42:31 -0800

> On Wed, Dec 4, 2013 at 12:17 PM, Uros Milosevic
> <[email protected]>wrote:
>
>> And here I was, thinking this would be simple. :)
>>
>> I really enjoyed myself reading about all the little details. JC,
>> please,
>> don't give up! :)
>>
>> > I don't know why this is a problem. For Greek we have many pages with
>> > English names too
>> > i.e. http://el.wikipedia.org/wiki/ASCII
>> > http://el.wikipedia.org/wiki/World_Wide_Web
>> >
>> > I see the following options here
>> > A) For URIs:
>> > 1) leave title as we get it from the Wikipedia dumps (suggested
>> option),
>> >      since we might get some links to the other script so we can
>> create
>> > sameAs links with a new extractor (easy)
>> > 2) give the option to transliterate *all* URIs to a preferred script
>> (we
>> > might miss some semantics when Latin was intended and we choose a
>> > non-latin
>> > script)
>>
>> The first option definitely makes more sense.
>>
>> >
>> > B) for literals:
>> > Make an option to transliterate to a preferred transliteration as
>> > discussed
>> > in the beginning
>> > We don't need to handle "preserve" in the parser since the only place
>> we
>> > might need it is the parser and this is already handled by the mw
>> engine
>> >
>> > The general outcome so far (if I understood correctly) would be to
>> > create a general class i.e. TriplePolicy that would handle policy
>> parsing
>> > UriPolicy will extend TriplePolicy and
>> > create a LiteralPolicy class that will handle literal values
>> >
>> > and maybe create a TransliterateSameAs extractor
>> >
>> > @Uros, you are the language expert here ;) can you suggest anything
>> > different?
>>
>> Finally, I get to feel like an expert on something. :) I think you
>> summed
>> it up nicely. The suggested solution sounds reasonable, although I'm a
>> little scared now and not sure I'd be of much help. Please do let me
>> know
>> if there's anything I can do, though.
>>
>
> For us this is a (very) low priority feature request and we have some
> major
> stuff to work on for the next months.
> If you are willing to try we will of course help you and peer review your
> code
> but other than that we cannot promise to implement this soon
>


I understand that, and don't expect anyone to break their neck over this.
As I said, there's much that's still unclear to me, but I'll look into it
and report back should I find it just too difficult to handle. I certainly
appreciate all your time, effort, tips and comments.

Best,
Uros

> Best,
> Dimitris
>
>
>>
>> Best,
>> Uros
>>
>> >
>> > Cheers,
>> > Dimitris
>> >
>> >
>> >
>> >
>> > On Tue, Dec 3, 2013 at 11:01 PM, Jona Christopher Sahnwaldt
>> > <[email protected]
>> >> wrote:
>> >
>> >> On 3 December 2013 21:34, Jona Christopher Sahnwaldt
>> <[email protected]>
>> >> wrote:
>> >> > On 3 December 2013 20:49, Andrea Di Menna <[email protected]>
>> wrote:
>> >> >> 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>> >> >>>
>> >> >>> On 3 December 2013 18:19, Andrea Di Menna <[email protected]>
>> wrote:
>> >> >>> > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>> >> >>> >>
>> >> >>> >> On 3 December 2013 16:54, Andrea Di Menna <[email protected]>
>> >> wrote:
>> >> >>> >> > Hi,
>> >> >>> >> >
>> >> >>> >> > I agree with JC that probably UriPolicy is not the best
>> place.
>> >> >>> >>
>> >> >>> >> I guess extending UriPolicy looks attractive because modifying
>> >> >>> >> literals has some common needs with modifying URIs. But we
>> should
>> >> >>> >> rather introduce a new class StringLiteralPolicy or so and
>> move
>> >> some
>> >> >>> >> code from UriPolicy to a common base class. Maybe we can share
>> >> the
>> >> >>> >> policy parsing code etc. But literals and URIs are quite
>> >> different
>> >> and
>> >> >>> >> should probably be handled by different classes.
>> >> >>> >>
>> >> >>> >> Maybe we need a new Destination subclass too (or instead).
>> >> Actually,
>> >> >>> >> if we follow YAGNI and KISS principles we should simply use a
>> >> >>> >> SerbianTransliterationDestination...
>> >> >>> >>
>> >> >>> >> > As per Uros use case I understand that what he would like to
>> >> obtain
>> >> >>> >> > is a
>> >> >>> >> > duplication of quads.
>> >> >>> >> > Probably this should be done in the Formatters or maybe as a
>> >> >>> >> > post-processing
>> >> >>> >> > operation?
>> >> >>> >> >
>> >> >>> >> > The problem is the following:
>> >> >>> >> > - some languages are officially digraphic, that is they can
>> use
>> >> two
>> >> >>> >> > different scripts (e.g. latin and cyrillic scripts)
>> >> >>> >> > - Serbian (sr) is a digraphic language (latin and cyrillic)
>> >> >>> >> > - Serbian wikipedia allows users to see articles in latin
>> and
>> >> >>> >> > cyrillic,
>> >> >>> >> > e.g.
>> >> >>> >> > cyrillic:
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >>
>> https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
>> >> >>> >> > latin:
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >>
>> https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81)
>> >> >>> >> > - wikipedia dumps do not contain both versions but only
>> >> cyrillic
>> >> in
>> >> >>> >> > 99%
>> >> >>> >> > of
>> >> >>> >> > the cases
>> >> >>> >> > - if you were to extract string objects from the sr dump you
>> >> would
>> >> >>> >> > get
>> >> >>> >> > cyrillic almost everywhere, for labels or for template
>> property
>> >> >>> >> > values
>> >> >>> >>
>> >> >>> >> I just looked at a few pages in the Serbian Wikipedia.
>> >> >>> >>
>> >> >>> >> There is a piece of MediaWiki syntax that I hadn't seen
>> before:
>> >> >>> >> wrapping text in -{...}- keeps it from being transliterated.
>> In
>> >> an
>> >> >>> >> ideal world, we would extend the DBpedia parser to handle
>> this...
>> >> >>> >>
>> >> >>> >> There are actually three ways a Serbian Wikipedia page can be
>> >> >>> >> displayed: unchanged, transliterated to Cyrillic,
>> transliterated
>> >> to
>> >> >>> >> Latin. For example, I put this wiki text on my Serbian
>> Wikipedia
>> >> user
>> >> >>> >> page:
>> >> >>> >>
>> >> >>> >> Unprotected: Test
>> >> >>> >> Protected: -{Test}-
>> >> >>> >> Unprotected: ÐÐ°ÑÑÐµÑ
>> >> >>> >> Protected: -{ÐÐ°ÑÑÐµÑ}-
>> >> >>> >>
>> >> >>> >> Depending on the URL, it is displayed in in different ways:
>> >> >>> >>
>> >> >>> >> http://sr.wikipedia.org/wiki/ÐÐ¾ÑÐ¸ÑÐ½Ð¸Ðº:Chrisahn or
>> >> >>> >> http://sr.wikipedia.org/sr/ÐÐ¾ÑÐ¸ÑÐ½Ð¸Ðº:Chrisahn -
>> unmodified
>> >> >>> >>
>> >> >>> >> Unprotected: Test
>> >> >>> >> Protected: Test
>> >> >>> >> Unprotected: ÐÐ°ÑÑÐµÑ
>> >> >>> >> Protected: ÐÐ°ÑÑÐµÑ
>> >> >>> >>
>> >> >>> >> http://sr.wikipedia.org/sr-ec/ÐÐ¾ÑÐ¸ÑÐ½Ð¸Ðº:Chrisahn -
>> >> transliterated to
>> >> >>> >> Cyrillic unless protected
>> >> >>> >>
>> >> >>> >> Ð£Ð½Ð¿ÑÐ¾ÑÐµÑÑÐµÐ´: Ð¢ÐµÑÑ
>> >> >>> >> ÐÑÐ¾ÑÐµÑÑÐµÐ´: Test
>> >> >>> >> Ð£Ð½Ð¿ÑÐ¾ÑÐµÑÑÐµÐ´: ÐÐ°ÑÑÐµÑ
>> >> >>> >> ÐÑÐ¾ÑÐµÑÑÐµÐ´: ÐÐ°ÑÑÐµÑ
>> >> >>> >>
>> >> >>> >> http://sr.wikipedia.org/sr-el/ÐÐ¾ÑÐ¸ÑÐ½Ð¸Ðº:Chrisahn -
>> >> transliterated to
>> >> >>> >> Latin unless protected
>> >> >>> >>
>> >> >>> >> Unprotected: Test
>> >> >>> >> Protected: Test
>> >> >>> >> Unprotected: Parser
>> >> >>> >> Protected: ÐÐ°ÑÑÐµÑ
>> >> >>> >>
>> >> >>> >
>> >> >>> > But still the content in the dumps will be the same, i.e. the
>> >> wikitext
>> >> >>> > you
>> >> >>> > have saved in your page.
>> >> >>> > No matter how you render it on the Mediawiki instance which
>> hosts
>> >> it.
>> >> >>> > Correct?
>> >> >>>
>> >> >>> Correct.
>> >> >>>
>> >> >>> >
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> >
>> >> >>> >> > Uros is wondering what would happen if a serbian user
>> searches
>> >> using
>> >> >>> >> > for
>> >> >>> >> > example the latin transliterated version of a cyrillic label
>> >> (e.g.
>> >> >>> >> > using
>> >> >>> >> > SPARQL on Virtuoso for example).
>> >> >>> >> > Their search would probably fail (unless Virtuoso implements
>> >> >>> >> > transliteration
>> >> >>> >> > on-the-fly).
>> >> >>> >> >
>> >> >>> >> > Romanization or Cyrillization are transliteration methods
>> which
>> >> are
>> >> >>> >> > also
>> >> >>> >> > available through ICU4J
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > [
>> >>
>> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html]
>> >> >>> >>
>> >> >>> >> Looks good, but is there an implementation for Serbian? If
>> there
>> >> >>> >> isn't, this probably won't help us much. Not enough to justify
>> >> adding
>> >> >>> >> ICU4J as a new dependency, I think.
>> >> >>> >>
>> >> >>> >
>> >> >>> > Yes there is a Transliterator with ID "Serbian-Latin/BGN" (a
>> list
>> >> here
>> >> >>> >
>> >> >>> >
>> >>
>> http://www.avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html
>> >> ,
>> >> >>> > don't know if this is still valid)
>> >> >>> > I have made some quick tests and it seems to work OK.
>> >> >>>
>> >> >>> Cool!
>> >> >>>
>> >> >>> >
>> >> >>> >>
>> >> >>> >> >
>> >> >>> >> > I think it does not make sense to transliterate URIs but
>> only
>> >> string
>> >> >>> >> > typed
>> >> >>> >> > values.
>> >> >>> >>
>> >> >>> >> I don't know. Wikipedia seems to have some elaborate rules in
>> >> place
>> >> as
>> >> >>> >> far as Latin/Cyrillic URLs are concerned. Maybe we should
>> follow
>> >> these
>> >> >>> >> rules too?
>> >> >>> >>
>> >> >>> >
>> >> >>> > Are the "preserve" rules also applied to wikilinks? If they are
>> >> not
>> >> then
>> >> >>> > I
>> >> >>> > think we should not apply transliteration to URIs.
>> >> >>>
>> >> >>> According to a few tests on my user page, the text (title)
>> displayed
>> >> >>> for a Wiki link is transliterated unless it's "protected" by
>> >> -{...}-.
>> >> >>> The actual link target is *always* the Cyrillic version, even if
>> the
>> >> >>> wiki text contains the Latin article name. Example: [[Johan
>> Volfgang
>> >> >>> Gete]] always results in a link to
>> >> >>> http://sr.wikipedia.org/wiki/ÐÐ¾ÑÐ°Ð½_ÐÐ¾Ð»ÑÐ³Ð°Ð½Ð³_ÐÐµÑÐµ
>> .
>> >> >>
>> >> >>
>> >> >> You're right (as usual ;))
>> >> >> I suppose the mediawiki instance transliterates the text in the
>> >> wikilink and
>> >> >> connects to the
>> >> >> cyrillic page on-the-fly, if it exists.
>> >> >> I think maybe Uros can help us understand what happens when you
>> >> create a
>> >> >> page, whether
>> >> >> you have to use a cyrillic title or you can also insert a latin
>> >> title.
>> >> >> Also, would be interesting to understand if the mediawiki instance
>> >> >> transliterates latin titles
>> >> >> on page creation.
>> >> >
>> >> > That's controlled by the __NOTITLECONVERT__ magic word. See
>> >> > https://www.mediawiki.org/wiki/Help:Magic_words . The Serbian
>> variants
>> >> > of the magic word are __ÐÐÐÐÐ__ and __BEZKN__ . See
>> >> >
>> >>
>> https://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/languages%2Fmessages%2FMessagesSr_ec.php
>> >> >
>> >> > Example: http://sr.wikipedia.org/wiki/ASCII isn't transliterated to
>> >> > http://sr.wikipedia.org/wiki/ÐÐ¡Ð¦ÐÐ . On the contrary:
>> >> [[ÐÐ¡Ð¦ÐÐ]] is
>> >> > rendered as a link to http://sr.wikipedia.org/wiki/ASCII .
>> >> >
>> >> > As usual with MediaWiki, the devil is very much in the details.
>> >>
>> >> ...and the deeper you dig, the more evil you find... There are pages
>> >> who *don't* contain __NOTITLECONVERT__ or its synonyms, and whose
>> >> titles still aren't transliterated, e.g.
>> >> http://sr.wikipedia.org/wiki/Little_endian or
>> >> http://sr.wikipedia.org/wiki/Acetil ... I'm giving up.
>> >>
>> >>
>> >> >
>> >> >> One approach could be to create owl:sameAs triples linking
>> cyrillic
>> >> >> resources to latin resources,
>> >> >> and then ignoring transliteration for URIs...
>> >> >>
>> >> >>>
>> >> >>> If we want DBpedia to use the same policy, then we *should*
>> >> >>> transliterate URIs. Currently, we always use the link target as
>> it's
>> >> >>> in the wiki source text. Example: for [[Johan Volfgang Gete]], we
>> >> >>> generate a link to
>> >> http://sr.dbpedia.org/resource/Johan_Volfgang_Gete
>> >> >>> . To be consistent with Wikipedia, the link should point to
>> >> >>> http://sr.dbpedia.org/resource/ÐÐ¾ÑÐ°Ð½_ÐÐ¾Ð»ÑÐ³Ð°Ð½Ð³_ÐÐµÑÐµ
>> >> instead.
>> >> >>>
>> >> >>
>> >> >> See above.
>> >> >>
>> >> >>>
>> >> >>> The main problem I see with transliterating URIs is
>> configuration.
>> >> >>> That's one of the main problems of DBpedia anyway. We're putting
>> too
>> >> >>> much effort into parsing configuration files. To allow
>> >> transliteration
>> >> >>> of URIs, we have to extend the UriPolicy syntax and parser, which
>> is
>> >> >>> already pretty convoluted anyway. If we used something like
>> Spring
>> >> >>> instead of self-made configuration stuff, we would simply add a
>> >> class
>> >> >>> and reference the class in the configuration. Additionally, we
>> >> should
>> >> >>> use different configuration objects for each language. That
>> doesn't
>> >> >>> have to mean that we need a separate configuration file for each
>> >> >>> language, just that we have to initialize the extraction
>> framework
>> >> >>> differently for each language. This would also make UriPolicy
>> >> >>> configuration easier.
>> >> >>>
>> >> >>> JC
>> >> >>
>> >> >>
>> >> >> I am with you :)
>> >> >> What about Typesafe Config? [1]
>> >> >>
>> >> >> [1] https://github.com/typesafehub/config
>> >> >>
>> >> >> Andrea
>> >> >>
>> >> >>>
>> >> >>>
>> >> >>> >
>> >> >>> > Cheers!
>> >> >>> > Andrea
>> >> >>> >
>> >> >>> >>
>> >> >>> >> Cheers,
>> >> >>> >> JC
>> >> >>> >>
>> >> >>> >> >
>> >> >>> >> > Cheers
>> >> >>> >> > Andrea
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]>
>> >> >>> >> >>
>> >> >>> >> >> Hi all,
>> >> >>> >> >>
>> >> >>> >> >> I don't think UriPolicy is a good place to do this...
>> >> >>> >> >>
>> >> >>> >> >> But anyway, I don't understand the problem yet. :-)
>> >> >>> >> >>
>> >> >>> >> >> Uros, you wrote about ISO 8859-2 and ISO 15924.
>> >> >>> >> >>
>> >> >>> >> >> ISO 8859-2 is a character encoding, but I'm pretty sure
>> that
>> >> >>> >> >> Wikipedia
>> >> >>> >> >> is not using it, and I know that DBpedia is not using it. I
>> >> think
>> >> >>> >> >> Wikipedia uses UTF-8 all over the place. I know that the
>> >> Wikipedia
>> >> >>> >> >> XML
>> >> >>> >> >> dumps are UTF-8 encoded, and so are the DBpedia dumps.
>> >> >>> >> >>
>> >> >>> >> >> ISO 15924 is not a character encoding, but a way to specify
>> >> the
>> >> >>> >> >> names
>> >> >>> >> >> of scripts. See https://en.wikipedia.org/wiki/ISO_15924
>> >> >>> >> >>
>> >> >>> >> >> What would romanization or cyrillization do exactly? Is
>> there
>> >> a
>> >> >>> >> >> one-to-one mapping between letters? Or letter sequences?
>> >> >>> >> >>
>> >> >>> >> >> Cheers,
>> >> >>> >> >> JC
>> >> >>> >> >>
>> >> >>> >> >> On 3 December 2013 16:02, Dimitris Kontokostas <
>> >> [email protected]>
>> >> >>> >> >> wrote:
>> >> >>> >> >> > Hi Uros,
>> >> >>> >> >> >
>> >> >>> >> >> > Don't worry, as we said we are here to help if you get
>> >> stuck;)
>> >> we
>> >> >>> >> >> > all
>> >> >>> >> >> > started like this.
>> >> >>> >> >> >
>> >> >>> >> >> > If you look at the formatters package you will understand
>> >> what's
>> >> >>> >> >> > going
>> >> >>> >> >> > on.
>> >> >>> >> >> > We have formatters that write a triple based on some
>> >> policies
>> >> we
>> >> >>> >> >> > define.
>> >> >>> >> >> > We parse the policies at runtime, create formatters based
>> on
>> >> these
>> >> >>> >> >> > policies
>> >> >>> >> >> > and feed them to destinations.
>> >> >>> >> >> >
>> >> >>> >> >> > I think we should generalize URIPolicy to TriplePolicy
>> and
>> >> create
>> >> >>> >> >> > a
>> >> >>> >> >> > "transliterate" action.
>> >> >>> >> >> > I made a change in the URIPolicy code to make it more
>> >> descriptive
>> >> >>> >> >> > [1]
>> >> >>> >> >> > Right now we have support only for URIs but if you change
>> >> the
>> >> >>> >> >> > following
>> >> >>> >> >> > it
>> >> >>> >> >> > should be a good start to make your changes
>> >> >>> >> >> >
>> >> >>> >> >> >   //String: Uri or Literal, Boolean: is URI or not,
>> String:
>> >> output
>> >> >>> >> >> > (new
>> >> >>> >> >> > URI
>> >> >>> >> >> > or transliterated string)
>> >> >>> >> >> >   type Policy = (String, Boolean) => String
>> >> >>> >> >> >
>> >> >>> >> >> >   type PolicyApplicable = (String, Boolean) => Boolean
>> >> >>> >> >> >
>> >> >>> >> >> > I also submitted a feature request [2], you can make a
>> >> proper
>> >> >>> >> >> > description
>> >> >>> >> >> > and continue the discussion there
>> >> >>> >> >> >
>> >> >>> >> >> > Cheers,
>> >> >>> >> >> > Dimitris
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > [1]
>> https://github.com/dbpedia/extraction-framework/pull/131
>> >> >>> >> >> > [2]
>> >> https://github.com/dbpedia/extraction-framework/issues/130
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic
>> >> >>> >> >> > <[email protected]>
>> >> >>> >> >> > wrote:
>> >> >>> >> >> >>
>> >> >>> >> >> >> Hi Andrea/Dimitris,
>> >> >>> >> >> >>
>> >> >>> >> >> >> Thanks for the tips. Actually, when I said I was no core
>> >> expert,
>> >> >>> >> >> >> I
>> >> >>> >> >> >> meant I
>> >> >>> >> >> >> was an absolute beginner. :) I wanted to go with an
>> >> extractor
>> >> >>> >> >> >> because
>> >> >>> >> >> >> that
>> >> >>> >> >> >> seemed simpler (and safer) than meddling with the core.
>> >> Most
>> >> of
>> >> >>> >> >> >> the
>> >> >>> >> >> >> stuff
>> >> >>> >> >> >> in there still seems rather confusing, but I'll look
>> into
>> >> it.
>> >> >>> >> >> >>
>> >> >>> >> >> >> So, the UriPolicy code is where the triples get written
>> >> (pointer
>> >> >>> >> >> >> to
>> >> >>> >> >> >> the
>> >> >>> >> >> >> exact line, anyone?), or is this simply where you'd like
>> to
>> >> place
>> >> >>> >> >> >> the
>> >> >>> >> >> >> new
>> >> >>> >> >> >> code? Also, would "UriPolicy" remain an adequate name
>> for
>> >> the
>> >> >>> >> >> >> class,
>> >> >>> >> >> >> then?
>> >> >>> >> >> >>
>> >> >>> >> >> >> Best,
>> >> >>> >> >> >> Uros
>> >> >>> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >> >> > Maybe something like:
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > where you specify a list of
>> (languageTag:transliterator)
>> >> >>> >> >> >> > separated
>> >> >>> >> >> >> > by
>> >> >>> >> >> >> > ';'
>> >> >>> >> >> >> > for one language?
>> >> >>> >> >> >> > The transliterator could be either "identity" (no
>> >> >>> >> >> >> > transformation)
>> >> >>> >> >> >> > or
>> >> >>> >> >> >> > a
>> >> >>> >> >> >> > icu4j transliterator-ID.
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > As Dimitris said, Uros please feel free to ask if you
>> >> need
>> >> >>> >> >> >> > help!
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > Cheers
>> >> >>> >> >> >> > Andrea
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > 2013/11/30 Dimitris Kontokostas <[email protected]>
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
>> >> >>> >> >> >> >> <[email protected]>wrote:
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>> Hello Uros,
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> that's a really interesting problem :)
>> >> >>> >> >> >> >>> I am no expert either but probably the best approach
>> >> would be
>> >> >>> >> >> >> >>> to
>> >> >>> >> >> >> >>> "duplicate" triples when they are going to be
>> written
>> >> (e.g.
>> >> >>> >> >> >> >>> in
>> >> >>> >> >> >> >>> the
>> >> >>> >> >> >> >>> destinations package), instead of modifying the
>> >> extractors.
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> I agree, I'd suggest we extend the UriPolicy [1]
>> >> functionality
>> >> >>> >> >> >> >> to
>> >> >>> >> >> >> >> do
>> >> >>> >> >> >> >> string object transformations (now it only applies to
>> >> URIs
>> >> /
>> >> >>> >> >> >> >> IRIs)
>> >> >>> >> >> >> >> and use the configuration files to select the desired
>> >> output
>> >> >>> >> >> >> >> [2].
>> >> >>> >> >> >> >> Uros, do you want to give it a shot? You can always
>> ask
>> >> for
>> >> >>> >> >> >> >> help
>> >> >>> >> >> >> >> here
>> >> >>> >> >> >> >> ;)
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> [1]
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >>
>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
>> >> >>> >> >> >> >> [2]
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >>
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>> For what regards which tools to use, it looks like
>> >> icu4j
>> >> >>> >> >> >> >>> Translitterator
>> >> >>> >> >> >> >>> suits your needs, e.g.
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("ÐÐ°Ð»Ð°ÐºÐ¾Ñ
>> >> >>> >> >> >> >>> 5
>> >> >>> >> >> >> >>> (ÐµÐ½Ð³Ð». Malachor V) ÑÐµ Ð¸Ð·Ð¼Ð¸ÑÑÐµÐ½Ð°
>> >> Ð¿Ð»Ð°Ð½ÐµÑÐ° Ñ
>> >> >>> >> >> >> >>> ÑÐ½Ð¸Ð²ÐµÑÐ·ÑÐ¼Ñ Ð Ð°ÑÐ¾Ð²Ð° Ð·Ð²ÐµÐ·Ð´Ð°.")
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> results in
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> Malakor 5 (engl. Malachor V) je izmiÅ¡ljena planeta
>> u
>> >> >>> >> >> >> >>> univerzumu
>> >> >>> >> >> >> >>> Ratova
>> >> >>> >> >> >>
>> >> >>> >> >> >> >>> zvezda.
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> What do you think?
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> Cheers
>> >> >>> >> >> >> >>>  Andrea
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> 2013/11/29 Uros Milosevic <[email protected]>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>> Hi all,
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> As some of you may know, a Serbian version of
>> DBpedia
>> >> is
>> >> >>> >> >> >> >>>> currently
>> >> >>> >> >> >> >>>> in
>> >> >>> >> >> >> >>>> the
>> >> >>> >> >> >> >>>> works. Now, Serbian, unlike any other language in
>> >> Europe, is
>> >> >>> >> >> >> >>>> digraphic
>> >> >>> >> >> >> >>>> in
>> >> >>> >> >> >> >>>> nature, officially supporting both (Serbian)
>> Cyrillic
>> >> and
>> >> >>> >> >> >> >>>> (Gaj's)
>> >> >>> >> >> >> >>>> Latin
>> >> >>> >> >> >> >>>> alphabet. This is absolutely fine for storing
>> >> information in
>> >> >>> >> >> >> >>>> any
>> >> >>> >> >> >> >>>> modern
>> >> >>> >> >> >> >>>> knowledge base, but can often be a major obstacle
>> for
>> >> >>> >> >> >> >>>> information
>> >> >>> >> >> >> >>>> retrieval.
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> For instance, most Serbs rely on the Latin alphabet
>> >> for
>> >> >>> >> >> >> >>>> communication/interaction on the Web. That means a
>> >> large
>> >> >>> >> >> >> >>>> portion
>> >> >>> >> >> >> >>>> of
>> >> >>> >> >> >> >>>> the
>> >> >>> >> >> >> >>>> information is (and, often, expected to be) encoded
>> in
>> >> ISO
>> >> >>> >> >> >> >>>> 8859-2
>> >> >>> >> >> >> >>>> (i.e.
>> >> >>> >> >> >> >>>> Latin-2). And, yet, 99% of the information in the
>> >> Serbian
>> >> >>> >> >> >> >>>> Wikipedia
>> >> >>> >> >> >> >>>> dumps
>> >> >>> >> >> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless
>> >> your
>> >> >>> >> >> >> >>>> software
>> >> >>> >> >> >> >>>> performs
>> >> >>> >> >> >> >>>> romanization (i.e. converts Cyrillic to Latin) or
>> >> >>> >> >> >> >>>> cyrillization
>> >> >>> >> >> >> >>>> (i.e.
>> >> >>> >> >> >> >>>> vice
>> >> >>> >> >> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia
>> >> appears
>> >> to
>> >> >>> >> >> >> >>>> be
>> >> >>> >> >> >> >>>> doing
>> >> >>> >> >> >> >>>> this),
>> >> >>> >> >> >> >>>> many attempts at information extraction will be
>> doomed
>> >> to
>> >> >>> >> >> >> >>>> fail.
>> >> >>> >> >> >> >>>> This
>> >> >>> >> >> >> >>>> directly affects common tasks such as keyword
>> search,
>> >> >>> >> >> >> >>>> label-based
>> >> >>> >> >> >> >>>> SPARQL
>> >> >>> >> >> >> >>>> querying, named entity recognition, etc.
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> What I would like to do is improve some of the
>> >> existing
>> >> >>> >> >> >> >>>> DBpedia
>> >> >>> >> >> >> >>>> extractors, or develop new ones, that would take
>> this
>> >> >>> >> >> >> >>>> problem
>> >> >>> >> >> >> >>>> into
>> >> >>> >> >> >> >>>> consideration and perform romanization of Wikipedia
>> >> dumps so
>> >> >>> >> >> >> >>>> as
>> >> >>> >> >> >> >>>> to
>> >> >>> >> >> >> >>>> output
>> >> >>> >> >> >> >>>> information encoded in *both* scripts. Now, I know
>> >> storing
>> >> >>> >> >> >> >>>> the
>> >> >>> >> >> >> >>>> same
>> >> >>> >> >> >> >>>> information twice might not be the most elegant
>> >> solution,
>> >> >>> >> >> >> >>>> but
>> >> >>> >> >> >> >>>> unless
>> >> >>> >> >> >> >>>> someone is to include romanization/cyrillization
>> >> features in
>> >> >>> >> >> >> >>>> the
>> >> >>> >> >> >> >>>> next
>> >> >>> >> >> >> >>>> version of SPARQL, I don't see a better solution at
>> >> the
>> >> >>> >> >> >> >>>> moment.
>> >> >>> >> >> >> >>>> Of
>> >> >>> >> >> >> >>>> course,
>> >> >>> >> >> >> >>>> there is also the matter of perspective - one could
>> >> argue
>> >> >>> >> >> >> >>>> that
>> >> >>> >> >> >> >>>> although
>> >> >>> >> >> >> >>>> the information is the same, the very fact that
>> >> different
>> >> >>> >> >> >> >>>> character
>> >> >>> >> >> >> >>>> sequences are needed to describe the same piece of
>> >> knowledge
>> >> >>> >> >> >> >>>> makes
>> >> >>> >> >> >> >>>> this
>> >> >>> >> >> >> >>>> problem fall into the domain of multilinguality.
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> So, the general idea is to use a single IRI per
>> >> resource,
>> >> >>> >> >> >> >>>> but
>> >> >>> >> >> >> >>>> have
>> >> >>> >> >> >> >>>> two
>> >> >>> >> >> >> >>>> separate triples for any literal originally encoded
>> in
>> >> >>> >> >> >> >>>> cyrillic.
>> >> >>> >> >> >> >>>> For
>> >> >>> >> >> >> >>>> example:
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> <
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >>
>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<
>> >>
>> http://sr.dbpedia.org/resource/Ð&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
>> <
>> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088
>> >
>> >> >
>> >> >>> >> >> >>
>> >> >>> >> >> >> >>>> ;>
>> >> >>> >> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>> >> >>> >> >> >> >>>> "&#1055;&#1072;&#1088;&#1089;&#1077;&#1088;"@sr-Cyrl
>> .
>> >> >>> >> >> >> >>>> <
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >>
>> http://sr.dbpedia.org/resource/&#1055;&#1072;&#1088;&#1089;&#1077;&#1088<
>> >>
>> http://sr.dbpedia.org/resource/Ð&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
>> <
>> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088
>> >
>> >> >
>> >> >>> >> >> >>
>> >> >>> >> >> >> >>>> ;>
>> >> >>> >> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
>> >> >>> >> >> >> >>>> "Parser"@sr-Latn .
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> The above language tags are as per IANA Language
>> >> Subtag
>> >> >>> >> >> >> >>>> Registry
>> >> >>> >> >> >> >>>> [1],
>> >> >>> >> >> >> >>>> which lists them as redundant, though, so a "sr"
>> tag,
>> >> >>> >> >> >> >>>> instead,
>> >> >>> >> >> >> >>>> could
>> >> >>> >> >> >> >>>> be
>> >> >>> >> >> >> >>>> enough for both.
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> I'm no DBpedia core expert, so some tips, ideas,
>> >> directions
>> >> >>> >> >> >> >>>> or
>> >> >>> >> >> >> >>>> any
>> >> >>> >> >> >> >>>> other
>> >> >>> >> >> >> >>>> information that would help me get started would be
>> >> much
>> >> >>> >> >> >> >>>> appreciated!
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> Best,
>> >> >>> >> >> >> >>>> Uros
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> [1]
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >>
>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>> >> >> >> >>>> Rapidly troubleshoot problems before they affect
>> your
>> >> >>> >> >> >> >>>> business.
>> >> >>> >> >> >> >>>> Most
>> >> >>> >> >> >> >>>> IT
>> >> >>> >> >> >> >>>> organizations don't have a clear picture of how
>> >> application
>> >> >>> >> >> >> >>>> performance
>> >> >>> >> >> >> >>>> affects their revenue. With AppDynamics, you get
>> 100%
>> >> >>> >> >> >> >>>> visibility
>> >> >>> >> >> >> >>>> into
>> >> >>> >> >> >> >>>> your
>> >> >>> >> >> >> >>>> Java,.NET, & PHP application. Start your 15-day
>> FREE
>> >> TRIAL
>> >> >>> >> >> >> >>>> of
>> >> >>> >> >> >> >>>> AppDynamics Pro!
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>> >> >> >> >>>> _______________________________________________
>> >> >>> >> >> >> >>>> Dbpedia-developers mailing list
>> >> >>> >> >> >> >>>> [email protected]
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>>
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>> >> >> >> >>> Rapidly troubleshoot problems before they affect
>> your
>> >> >>> >> >> >> >>> business.
>> >> >>> >> >> >> >>> Most
>> >> >>> >> >> >> >>> IT
>> >> >>> >> >> >> >>> organizations don't have a clear picture of how
>> >> application
>> >> >>> >> >> >> >>> performance
>> >> >>> >> >> >> >>> affects their revenue. With AppDynamics, you get
>> 100%
>> >> >>> >> >> >> >>> visibility
>> >> >>> >> >> >> >>> into
>> >> >>> >> >> >> >>> your
>> >> >>> >> >> >> >>> Java,.NET, & PHP application. Start your 15-day FREE
>> >> TRIAL of
>> >> >>> >> >> >> >>> AppDynamics
>> >> >>> >> >> >> >>> Pro!
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>> >> >> >> >>> _______________________________________________
>> >> >>> >> >> >> >>> Dbpedia-developers mailing list
>> >> >>> >> >> >> >>> [email protected]
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> --
>> >> >>> >> >> >> >> Kontokostas Dimitris
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >
>> >> >>> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > --
>> >> >>> >> >> > Kontokostas Dimitris
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >>
>> ------------------------------------------------------------------------------
>> >> >>> >> >> > Rapidly troubleshoot problems before they affect your
>> >> business.
>> >> >>> >> >> > Most
>> >> >>> >> >> > IT
>> >> >>> >> >> > organizations don't have a clear picture of how
>> application
>> >> >>> >> >> > performance
>> >> >>> >> >> > affects their revenue. With AppDynamics, you get 100%
>> >> visibility
>> >> >>> >> >> > into
>> >> >>> >> >> > your
>> >> >>> >> >> > Java,.NET, & PHP application. Start your 15-day FREE
>> TRIAL
>> >> of
>> >> >>> >> >> > AppDynamics
>> >> >>> >> >> > Pro!
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>> >> >> > _______________________________________________
>> >> >>> >> >> > Dbpedia-developers mailing list
>> >> >>> >> >> > [email protected]
>> >> >>> >> >> >
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> >>> >> >> Rapidly troubleshoot problems before they affect your
>> >> business.
>> >> Most
>> >> >>> >> >> IT
>> >> >>> >> >> organizations don't have a clear picture of how application
>> >> >>> >> >> performance
>> >> >>> >> >> affects their revenue. With AppDynamics, you get 100%
>> >> visibility
>> >> >>> >> >> into
>> >> >>> >> >> your
>> >> >>> >> >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL
>> of
>> >> >>> >> >> AppDynamics
>> >> >>> >> >> Pro!
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> Dbpedia-developers mailing list
>> >> >>> >> >> [email protected]
>> >> >>> >> >>
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >
>> >> >>> >
>> >> >>
>> >> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Kontokostas Dimitris
>> >
>> ------------------------------------------------------------------------------
>> > Sponsored by Intel(R) XDK
>> > Develop, test and display web and hybrid apps with a single code base.
>> > Download it for free now!
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk_______________________________________________
>> > Dbpedia-developers mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >
>>
>>
>>
>
>
> --
> Kontokostas Dimitris
>


------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Support for digraphia

Reply via email to