On 3 December 2013 16:54, Andrea Di Menna <[email protected]> wrote: > Hi, > > I agree with JC that probably UriPolicy is not the best place.
I guess extending UriPolicy looks attractive because modifying literals has some common needs with modifying URIs. But we should rather introduce a new class StringLiteralPolicy or so and move some code from UriPolicy to a common base class. Maybe we can share the policy parsing code etc. But literals and URIs are quite different and should probably be handled by different classes. Maybe we need a new Destination subclass too (or instead). Actually, if we follow YAGNI and KISS principles we should simply use a SerbianTransliterationDestination... > As per Uros use case I understand that what he would like to obtain is a > duplication of quads. > Probably this should be done in the Formatters or maybe as a post-processing > operation? > > The problem is the following: > - some languages are officially digraphic, that is they can use two > different scripts (e.g. latin and cyrillic scripts) > - Serbian (sr) is a digraphic language (latin and cyrillic) > - Serbian wikipedia allows users to see articles in latin and cyrillic, e.g. > cyrillic: > https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81) > latin: > https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81) > - wikipedia dumps do not contain both versions but only cyrillic in 99% of > the cases > - if you were to extract string objects from the sr dump you would get > cyrillic almost everywhere, for labels or for template property values I just looked at a few pages in the Serbian Wikipedia. There is a piece of MediaWiki syntax that I hadn't seen before: wrapping text in -{...}- keeps it from being transliterated. In an ideal world, we would extend the DBpedia parser to handle this... There are actually three ways a Serbian Wikipedia page can be displayed: unchanged, transliterated to Cyrillic, transliterated to Latin. For example, I put this wiki text on my Serbian Wikipedia user page: Unprotected: Test Protected: -{Test}- Unprotected: Парсер Protected: -{Парсер}- Depending on the URL, it is displayed in in different ways: http://sr.wikipedia.org/wiki/Корисник:Chrisahn or http://sr.wikipedia.org/sr/Корисник:Chrisahn - unmodified Unprotected: Test Protected: Test Unprotected: Парсер Protected: Парсер http://sr.wikipedia.org/sr-ec/Корисник:Chrisahn - transliterated to Cyrillic unless protected Унпротецтед: Тест Протецтед: Test Унпротецтед: Парсер Протецтед: Парсер http://sr.wikipedia.org/sr-el/Корисник:Chrisahn - transliterated to Latin unless protected Unprotected: Test Protected: Test Unprotected: Parser Protected: Парсер > > Uros is wondering what would happen if a serbian user searches using for > example the latin transliterated version of a cyrillic label (e.g. using > SPARQL on Virtuoso for example). > Their search would probably fail (unless Virtuoso implements transliteration > on-the-fly). > > Romanization or Cyrillization are transliteration methods which are also > available through ICU4J > [http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html] Looks good, but is there an implementation for Serbian? If there isn't, this probably won't help us much. Not enough to justify adding ICU4J as a new dependency, I think. > > I think it does not make sense to transliterate URIs but only string typed > values. I don't know. Wikipedia seems to have some elaborate rules in place as far as Latin/Cyrillic URLs are concerned. Maybe we should follow these rules too? Cheers, JC > > Cheers > Andrea > > > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]> >> >> Hi all, >> >> I don't think UriPolicy is a good place to do this... >> >> But anyway, I don't understand the problem yet. :-) >> >> Uros, you wrote about ISO 8859-2 and ISO 15924. >> >> ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia >> is not using it, and I know that DBpedia is not using it. I think >> Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML >> dumps are UTF-8 encoded, and so are the DBpedia dumps. >> >> ISO 15924 is not a character encoding, but a way to specify the names >> of scripts. See https://en.wikipedia.org/wiki/ISO_15924 >> >> What would romanization or cyrillization do exactly? Is there a >> one-to-one mapping between letters? Or letter sequences? >> >> Cheers, >> JC >> >> On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote: >> > Hi Uros, >> > >> > Don't worry, as we said we are here to help if you get stuck;) we all >> > started like this. >> > >> > If you look at the formatters package you will understand what's going >> > on. >> > We have formatters that write a triple based on some policies we define. >> > We parse the policies at runtime, create formatters based on these >> > policies >> > and feed them to destinations. >> > >> > I think we should generalize URIPolicy to TriplePolicy and create a >> > "transliterate" action. >> > I made a change in the URIPolicy code to make it more descriptive [1] >> > Right now we have support only for URIs but if you change the following >> > it >> > should be a good start to make your changes >> > >> > //String: Uri or Literal, Boolean: is URI or not, String: output (new >> > URI >> > or transliterated string) >> > type Policy = (String, Boolean) => String >> > >> > type PolicyApplicable = (String, Boolean) => Boolean >> > >> > I also submitted a feature request [2], you can make a proper >> > description >> > and continue the discussion there >> > >> > Cheers, >> > Dimitris >> > >> > >> > [1] https://github.com/dbpedia/extraction-framework/pull/131 >> > [2] https://github.com/dbpedia/extraction-framework/issues/130 >> > >> > >> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]> >> > wrote: >> >> >> >> Hi Andrea/Dimitris, >> >> >> >> Thanks for the tips. Actually, when I said I was no core expert, I >> >> meant I >> >> was an absolute beginner. :) I wanted to go with an extractor because >> >> that >> >> seemed simpler (and safer) than meddling with the core. Most of the >> >> stuff >> >> in there still seems rather confusing, but I'll look into it. >> >> >> >> So, the UriPolicy code is where the triples get written (pointer to the >> >> exact line, anyone?), or is this simply where you'd like to place the >> >> new >> >> code? Also, would "UriPolicy" remain an adequate name for the class, >> >> then? >> >> >> >> Best, >> >> Uros >> >> >> >> >> >> > Maybe something like: >> >> > >> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN >> >> > >> >> > where you specify a list of (languageTag:transliterator) separated by >> >> > ';' >> >> > for one language? >> >> > The transliterator could be either "identity" (no transformation) or >> >> > a >> >> > icu4j transliterator-ID. >> >> > >> >> > As Dimitris said, Uros please feel free to ask if you need help! >> >> > >> >> > Cheers >> >> > Andrea >> >> > >> >> > >> >> > 2013/11/30 Dimitris Kontokostas <[email protected]> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna >> >> >> <[email protected]>wrote: >> >> >> >> >> >>> Hello Uros, >> >> >>> >> >> >>> that's a really interesting problem :) >> >> >>> I am no expert either but probably the best approach would be to >> >> >>> "duplicate" triples when they are going to be written (e.g. in the >> >> >>> destinations package), instead of modifying the extractors. >> >> >>> >> >> >> >> >> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do >> >> >> string object transformations (now it only applies to URIs / IRIs) >> >> >> and use the configuration files to select the desired output [2]. >> >> >> Uros, do you want to give it a shot? You can always ask for help >> >> >> here >> >> >> ;) >> >> >> >> >> >> [1] >> >> >> >> >> >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala >> >> >> [2] >> >> >> >> >> >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130 >> >> >> >> >> >> >> >> >>> For what regards which tools to use, it looks like icu4j >> >> >>> Translitterator >> >> >>> suits your needs, e.g. >> >> >>> >> >> >>> >> >> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор >> >> >>> 5 >> >> >>> (енгл. Malachor V) је измишљена планета у >> >> >>> универзуму Ратова звезда.") >> >> >>> >> >> >>> results in >> >> >>> >> >> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu >> >> >>> Ratova >> >> >> >> >>> zvezda. >> >> >>> >> >> >>> What do you think? >> >> >>> >> >> >>> Cheers >> >> >>> Andrea >> >> >>> >> >> >>> >> >> >>> 2013/11/29 Uros Milosevic <[email protected]> >> >> >>> >> >> >>>> Hi all, >> >> >>>> >> >> >>>> As some of you may know, a Serbian version of DBpedia is currently >> >> >>>> in >> >> >>>> the >> >> >>>> works. Now, Serbian, unlike any other language in Europe, is >> >> >>>> digraphic >> >> >>>> in >> >> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) >> >> >>>> Latin >> >> >>>> alphabet. This is absolutely fine for storing information in any >> >> >>>> modern >> >> >>>> knowledge base, but can often be a major obstacle for information >> >> >>>> retrieval. >> >> >>>> >> >> >>>> For instance, most Serbs rely on the Latin alphabet for >> >> >>>> communication/interaction on the Web. That means a large portion >> >> >>>> of >> >> >>>> the >> >> >>>> information is (and, often, expected to be) encoded in ISO 8859-2 >> >> >>>> (i.e. >> >> >>>> Latin-2). And, yet, 99% of the information in the Serbian >> >> >>>> Wikipedia >> >> >>>> dumps >> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software >> >> >>>> performs >> >> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization >> >> >>>> (i.e. >> >> >>>> vice >> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be >> >> >>>> doing >> >> >>>> this), >> >> >>>> many attempts at information extraction will be doomed to fail. >> >> >>>> This >> >> >>>> directly affects common tasks such as keyword search, label-based >> >> >>>> SPARQL >> >> >>>> querying, named entity recognition, etc. >> >> >>>> >> >> >>>> What I would like to do is improve some of the existing DBpedia >> >> >>>> extractors, or develop new ones, that would take this problem into >> >> >>>> consideration and perform romanization of Wikipedia dumps so as to >> >> >>>> output >> >> >>>> information encoded in *both* scripts. Now, I know storing the >> >> >>>> same >> >> >>>> information twice might not be the most elegant solution, but >> >> >>>> unless >> >> >>>> someone is to include romanization/cyrillization features in the >> >> >>>> next >> >> >>>> version of SPARQL, I don't see a better solution at the moment. Of >> >> >>>> course, >> >> >>>> there is also the matter of perspective - one could argue that >> >> >>>> although >> >> >>>> the information is the same, the very fact that different >> >> >>>> character >> >> >>>> sequences are needed to describe the same piece of knowledge makes >> >> >>>> this >> >> >>>> problem fall into the domain of multilinguality. >> >> >>>> >> >> >>>> So, the general idea is to use a single IRI per resource, but have >> >> >>>> two >> >> >>>> separate triples for any literal originally encoded in cyrillic. >> >> >>>> For >> >> >>>> example: >> >> >>>> >> >> >>>> < >> >> >>>> >> >> >>>> >> >> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >> >> >> >> >>>> ;> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> >> >> >>>> "Парсер"@sr-Cyrl . >> >> >>>> < >> >> >>>> >> >> >>>> >> >> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >> >> >> >> >>>> ;> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn . >> >> >>>> >> >> >>>> The above language tags are as per IANA Language Subtag Registry >> >> >>>> [1], >> >> >>>> which lists them as redundant, though, so a "sr" tag, instead, >> >> >>>> could >> >> >>>> be >> >> >>>> enough for both. >> >> >>>> >> >> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any >> >> >>>> other >> >> >>>> information that would help me get started would be much >> >> >>>> appreciated! >> >> >>>> >> >> >>>> Best, >> >> >>>> Uros >> >> >>>> >> >> >>>> [1] >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> ------------------------------------------------------------------------------ >> >> >>>> Rapidly troubleshoot problems before they affect your business. >> >> >>>> Most >> >> >>>> IT >> >> >>>> organizations don't have a clear picture of how application >> >> >>>> performance >> >> >>>> affects their revenue. With AppDynamics, you get 100% visibility >> >> >>>> into >> >> >>>> your >> >> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >> >> >>>> AppDynamics Pro! >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >>>> _______________________________________________ >> >> >>>> Dbpedia-developers mailing list >> >> >>>> [email protected] >> >> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >>>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> ------------------------------------------------------------------------------ >> >> >>> Rapidly troubleshoot problems before they affect your business. >> >> >>> Most >> >> >>> IT >> >> >>> organizations don't have a clear picture of how application >> >> >>> performance >> >> >>> affects their revenue. With AppDynamics, you get 100% visibility >> >> >>> into >> >> >>> your >> >> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >> >> >>> AppDynamics >> >> >>> Pro! >> >> >>> >> >> >>> >> >> >>> >> >> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >>> _______________________________________________ >> >> >>> Dbpedia-developers mailing list >> >> >>> [email protected] >> >> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> >> -- >> >> >> Kontokostas Dimitris >> >> >> >> >> > >> >> >> >> >> > >> > >> > >> > -- >> > Kontokostas Dimitris >> > >> > >> > ------------------------------------------------------------------------------ >> > Rapidly troubleshoot problems before they affect your business. Most IT >> > organizations don't have a clear picture of how application performance >> > affects their revenue. With AppDynamics, you get 100% visibility into >> > your >> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >> > AppDynamics >> > Pro! >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Dbpedia-developers mailing list >> > [email protected] >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> > >> >> >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> _______________________________________________ >> Dbpedia-developers mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers > > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
