Actually, that's what I had in mind, at least for starters. As for the
digram problem ('DŽ' = 'Џ' / 'ДЖ') that would be an issue only if I were
to convert Latin to Cyrillic (there are no digrams in Cyrillic). As this
is going to be the other way round, it should be pretty much straight
forward.Best, Uros > If you don't care about two or three letter combinations (like 'DŽ' => > 'Џ', > 'd!ž' => 'дж'), you could simply post-process the DBpedia files with the > tr > command line tool. If your version of tr can handle Unicode... See > http://en.wikipedia.org/wiki/Tr_(Unix) > > For the record: here's the mapping used by MediaWiki: > http://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/languages%2Fclasses%2FLanguageSr.php > On Dec 5, 2013 8:41 AM, "Uros Milosevic" <[email protected]> wrote: > >> > On Wed, Dec 4, 2013 at 12:17 PM, Uros Milosevic >> > <[email protected]>wrote: >> > >> >> And here I was, thinking this would be simple. :) >> >> >> >> I really enjoyed myself reading about all the little details. JC, >> >> please, >> >> don't give up! :) >> >> >> >> > I don't know why this is a problem. For Greek we have many pages >> with >> >> > English names too >> >> > i.e. http://el.wikipedia.org/wiki/ASCII >> >> > http://el.wikipedia.org/wiki/World_Wide_Web >> >> > >> >> > I see the following options here >> >> > A) For URIs: >> >> > 1) leave title as we get it from the Wikipedia dumps (suggested >> >> option), >> >> > since we might get some links to the other script so we can >> >> create >> >> > sameAs links with a new extractor (easy) >> >> > 2) give the option to transliterate *all* URIs to a preferred >> script >> >> (we >> >> > might miss some semantics when Latin was intended and we choose a >> >> > non-latin >> >> > script) >> >> >> >> The first option definitely makes more sense. >> >> >> >> > >> >> > B) for literals: >> >> > Make an option to transliterate to a preferred transliteration as >> >> > discussed >> >> > in the beginning >> >> > We don't need to handle "preserve" in the parser since the only >> place >> >> we >> >> > might need it is the parser and this is already handled by the mw >> >> engine >> >> > >> >> > The general outcome so far (if I understood correctly) would be to >> >> > create a general class i.e. TriplePolicy that would handle policy >> >> parsing >> >> > UriPolicy will extend TriplePolicy and >> >> > create a LiteralPolicy class that will handle literal values >> >> > >> >> > and maybe create a TransliterateSameAs extractor >> >> > >> >> > @Uros, you are the language expert here ;) can you suggest anything >> >> > different? >> >> >> >> Finally, I get to feel like an expert on something. :) I think you >> >> summed >> >> it up nicely. The suggested solution sounds reasonable, although I'm >> a >> >> little scared now and not sure I'd be of much help. Please do let me >> >> know >> >> if there's anything I can do, though. >> >> >> > >> > For us this is a (very) low priority feature request and we have some >> > major >> > stuff to work on for the next months. >> > If you are willing to try we will of course help you and peer review >> your >> > code >> > but other than that we cannot promise to implement this soon >> > >> >> I understand that, and don't expect anyone to break their neck over >> this. >> As I said, there's much that's still unclear to me, but I'll look into >> it >> and report back should I find it just too difficult to handle. I >> certainly >> appreciate all your time, effort, tips and comments. >> >> Best, >> Uros >> >> > Best, >> > Dimitris >> > >> > >> >> >> >> Best, >> >> Uros >> >> >> >> > >> >> > Cheers, >> >> > Dimitris >> >> > >> >> > >> >> > >> >> > >> >> > On Tue, Dec 3, 2013 at 11:01 PM, Jona Christopher Sahnwaldt >> >> > <[email protected] >> >> >> wrote: >> >> > >> >> >> On 3 December 2013 21:34, Jona Christopher Sahnwaldt >> >> <[email protected]> >> >> >> wrote: >> >> >> > On 3 December 2013 20:49, Andrea Di Menna <[email protected]> >> >> wrote: >> >> >> >> 2013/12/3 Jona Christopher Sahnwaldt <[email protected]> >> >> >> >>> >> >> >> >>> On 3 December 2013 18:19, Andrea Di Menna <[email protected]> >> >> wrote: >> >> >> >>> > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]> >> >> >> >>> >> >> >> >> >>> >> On 3 December 2013 16:54, Andrea Di Menna >> <[email protected]> >> >> >> wrote: >> >> >> >>> >> > Hi, >> >> >> >>> >> > >> >> >> >>> >> > I agree with JC that probably UriPolicy is not the best >> >> place. >> >> >> >>> >> >> >> >> >>> >> I guess extending UriPolicy looks attractive because >> modifying >> >> >> >>> >> literals has some common needs with modifying URIs. But we >> >> should >> >> >> >>> >> rather introduce a new class StringLiteralPolicy or so and >> >> move >> >> >> some >> >> >> >>> >> code from UriPolicy to a common base class. Maybe we can >> share >> >> >> the >> >> >> >>> >> policy parsing code etc. But literals and URIs are quite >> >> >> different >> >> >> and >> >> >> >>> >> should probably be handled by different classes. >> >> >> >>> >> >> >> >> >>> >> Maybe we need a new Destination subclass too (or instead). >> >> >> Actually, >> >> >> >>> >> if we follow YAGNI and KISS principles we should simply use >> a >> >> >> >>> >> SerbianTransliterationDestination... >> >> >> >>> >> >> >> >> >>> >> > As per Uros use case I understand that what he would like >> to >> >> >> obtain >> >> >> >>> >> > is a >> >> >> >>> >> > duplication of quads. >> >> >> >>> >> > Probably this should be done in the Formatters or maybe >> as a >> >> >> >>> >> > post-processing >> >> >> >>> >> > operation? >> >> >> >>> >> > >> >> >> >>> >> > The problem is the following: >> >> >> >>> >> > - some languages are officially digraphic, that is they >> can >> >> use >> >> >> two >> >> >> >>> >> > different scripts (e.g. latin and cyrillic scripts) >> >> >> >>> >> > - Serbian (sr) is a digraphic language (latin and >> cyrillic) >> >> >> >>> >> > - Serbian wikipedia allows users to see articles in latin >> >> and >> >> >> >>> >> > cyrillic, >> >> >> >>> >> > e.g. >> >> >> >>> >> > cyrillic: >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >> >> >> https://sr.wikipedia.org/sr-ec/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81) >> >> >> >>> >> > latin: >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >> >> >> https://sr.wikipedia.org/sr-el/%D0%93%D0%BE%D1%81%D0%BD%D0%B5%D0%BB_(%D0%90%D1%80%D0%BA%D0%B0%D0%BD%D0%B7%D0%B0%D1%81) >> >> >> >>> >> > - wikipedia dumps do not contain both versions but only >> >> >> cyrillic >> >> >> in >> >> >> >>> >> > 99% >> >> >> >>> >> > of >> >> >> >>> >> > the cases >> >> >> >>> >> > - if you were to extract string objects from the sr dump >> you >> >> >> would >> >> >> >>> >> > get >> >> >> >>> >> > cyrillic almost everywhere, for labels or for template >> >> property >> >> >> >>> >> > values >> >> >> >>> >> >> >> >> >>> >> I just looked at a few pages in the Serbian Wikipedia. >> >> >> >>> >> >> >> >> >>> >> There is a piece of MediaWiki syntax that I hadn't seen >> >> before: >> >> >> >>> >> wrapping text in -{...}- keeps it from being >> transliterated. >> >> In >> >> >> an >> >> >> >>> >> ideal world, we would extend the DBpedia parser to handle >> >> this... >> >> >> >>> >> >> >> >> >>> >> There are actually three ways a Serbian Wikipedia page can >> be >> >> >> >>> >> displayed: unchanged, transliterated to Cyrillic, >> >> transliterated >> >> >> to >> >> >> >>> >> Latin. For example, I put this wiki text on my Serbian >> >> Wikipedia >> >> >> user >> >> >> >>> >> page: >> >> >> >>> >> >> >> >> >>> >> Unprotected: Test >> >> >> >>> >> Protected: -{Test}- >> >> >> >>> >> Unprotected: Парсер >> >> >> >>> >> Protected: -{Парсер}- >> >> >> >>> >> >> >> >> >>> >> Depending on the URL, it is displayed in in different ways: >> >> >> >>> >> >> >> >> >>> >> http://sr.wikipedia.org/wiki/Корисник:Chrisahn or >> >> >> >>> >> http://sr.wikipedia.org/sr/Корисник:Chrisahn - >> >> unmodified >> >> >> >>> >> >> >> >> >>> >> Unprotected: Test >> >> >> >>> >> Protected: Test >> >> >> >>> >> Unprotected: Парсер >> >> >> >>> >> Protected: Парсер >> >> >> >>> >> >> >> >> >>> >> http://sr.wikipedia.org/sr-ec/Корисник:Chrisahn - >> >> >> transliterated to >> >> >> >>> >> Cyrillic unless protected >> >> >> >>> >> >> >> >> >>> >> Унпротецтед: Тест >> >> >> >>> >> Протецтед: Test >> >> >> >>> >> Унпротецтед: Парсер >> >> >> >>> >> Протецтед: Парсер >> >> >> >>> >> >> >> >> >>> >> http://sr.wikipedia.org/sr-el/Корисник:Chrisahn - >> >> >> transliterated to >> >> >> >>> >> Latin unless protected >> >> >> >>> >> >> >> >> >>> >> Unprotected: Test >> >> >> >>> >> Protected: Test >> >> >> >>> >> Unprotected: Parser >> >> >> >>> >> Protected: Парсер >> >> >> >>> >> >> >> >> >>> > >> >> >> >>> > But still the content in the dumps will be the same, i.e. >> the >> >> >> wikitext >> >> >> >>> > you >> >> >> >>> > have saved in your page. >> >> >> >>> > No matter how you render it on the Mediawiki instance which >> >> hosts >> >> >> it. >> >> >> >>> > Correct? >> >> >> >>> >> >> >> >>> Correct. >> >> >> >>> >> >> >> >>> > >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> > >> >> >> >>> >> > Uros is wondering what would happen if a serbian user >> >> searches >> >> >> using >> >> >> >>> >> > for >> >> >> >>> >> > example the latin transliterated version of a cyrillic >> label >> >> >> (e.g. >> >> >> >>> >> > using >> >> >> >>> >> > SPARQL on Virtuoso for example). >> >> >> >>> >> > Their search would probably fail (unless Virtuoso >> implements >> >> >> >>> >> > transliteration >> >> >> >>> >> > on-the-fly). >> >> >> >>> >> > >> >> >> >>> >> > Romanization or Cyrillization are transliteration methods >> >> which >> >> >> are >> >> >> >>> >> > also >> >> >> >>> >> > available through ICU4J >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > [ >> >> >> >> >> >> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html] >> >> >> >>> >> >> >> >> >>> >> Looks good, but is there an implementation for Serbian? If >> >> there >> >> >> >>> >> isn't, this probably won't help us much. Not enough to >> justify >> >> >> adding >> >> >> >>> >> ICU4J as a new dependency, I think. >> >> >> >>> >> >> >> >> >>> > >> >> >> >>> > Yes there is a Transliterator with ID "Serbian-Latin/BGN" (a >> >> list >> >> >> here >> >> >> >>> > >> >> >> >>> > >> >> >> >> >> >> http://www.avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html >> >> >> , >> >> >> >>> > don't know if this is still valid) >> >> >> >>> > I have made some quick tests and it seems to work OK. >> >> >> >>> >> >> >> >>> Cool! >> >> >> >>> >> >> >> >>> > >> >> >> >>> >> >> >> >> >>> >> > >> >> >> >>> >> > I think it does not make sense to transliterate URIs but >> >> only >> >> >> string >> >> >> >>> >> > typed >> >> >> >>> >> > values. >> >> >> >>> >> >> >> >> >>> >> I don't know. Wikipedia seems to have some elaborate rules >> in >> >> >> place >> >> >> as >> >> >> >>> >> far as Latin/Cyrillic URLs are concerned. Maybe we should >> >> follow >> >> >> these >> >> >> >>> >> rules too? >> >> >> >>> >> >> >> >> >>> > >> >> >> >>> > Are the "preserve" rules also applied to wikilinks? If they >> are >> >> >> not >> >> >> then >> >> >> >>> > I >> >> >> >>> > think we should not apply transliteration to URIs. >> >> >> >>> >> >> >> >>> According to a few tests on my user page, the text (title) >> >> displayed >> >> >> >>> for a Wiki link is transliterated unless it's "protected" by >> >> >> -{...}-. >> >> >> >>> The actual link target is *always* the Cyrillic version, even >> if >> >> the >> >> >> >>> wiki text contains the Latin article name. Example: [[Johan >> >> Volfgang >> >> >> >>> Gete]] always results in a link to >> >> >> >>> http://sr.wikipedia.org/wiki/Јохан_Волфганг_Гете >> >> . >> >> >> >> >> >> >> >> >> >> >> >> You're right (as usual ;)) >> >> >> >> I suppose the mediawiki instance transliterates the text in the >> >> >> wikilink and >> >> >> >> connects to the >> >> >> >> cyrillic page on-the-fly, if it exists. >> >> >> >> I think maybe Uros can help us understand what happens when you >> >> >> create a >> >> >> >> page, whether >> >> >> >> you have to use a cyrillic title or you can also insert a latin >> >> >> title. >> >> >> >> Also, would be interesting to understand if the mediawiki >> instance >> >> >> >> transliterates latin titles >> >> >> >> on page creation. >> >> >> > >> >> >> > That's controlled by the __NOTITLECONVERT__ magic word. See >> >> >> > https://www.mediawiki.org/wiki/Help:Magic_words . The Serbian >> >> variants >> >> >> > of the magic word are __БЕЗКН__ and __BEZKN__ . See >> >> >> > >> >> >> >> >> >> https://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/languages%2Fmessages%2FMessagesSr_ec.php >> >> >> > >> >> >> > Example: http://sr.wikipedia.org/wiki/ASCII isn't transliterated >> to >> >> >> > http://sr.wikipedia.org/wiki/АСЦИИ . On the contrary: >> >> >> [[АСЦИИ]] is >> >> >> > rendered as a link to http://sr.wikipedia.org/wiki/ASCII . >> >> >> > >> >> >> > As usual with MediaWiki, the devil is very much in the details. >> >> >> >> >> >> ...and the deeper you dig, the more evil you find... There are >> pages >> >> >> who *don't* contain __NOTITLECONVERT__ or its synonyms, and whose >> >> >> titles still aren't transliterated, e.g. >> >> >> http://sr.wikipedia.org/wiki/Little_endian or >> >> >> http://sr.wikipedia.org/wiki/Acetil ... I'm giving up. >> >> >> >> >> >> >> >> >> > >> >> >> >> One approach could be to create owl:sameAs triples linking >> >> cyrillic >> >> >> >> resources to latin resources, >> >> >> >> and then ignoring transliteration for URIs... >> >> >> >> >> >> >> >>> >> >> >> >>> If we want DBpedia to use the same policy, then we *should* >> >> >> >>> transliterate URIs. Currently, we always use the link target >> as >> >> it's >> >> >> >>> in the wiki source text. Example: for [[Johan Volfgang Gete]], >> we >> >> >> >>> generate a link to >> >> >> http://sr.dbpedia.org/resource/Johan_Volfgang_Gete >> >> >> >>> . To be consistent with Wikipedia, the link should point to >> >> >> >>> http://sr.dbpedia.org/resource/Јохан_Волфганг_Гете >> >> >> instead. >> >> >> >>> >> >> >> >> >> >> >> >> See above. >> >> >> >> >> >> >> >>> >> >> >> >>> The main problem I see with transliterating URIs is >> >> configuration. >> >> >> >>> That's one of the main problems of DBpedia anyway. We're >> putting >> >> too >> >> >> >>> much effort into parsing configuration files. To allow >> >> >> transliteration >> >> >> >>> of URIs, we have to extend the UriPolicy syntax and parser, >> which >> >> is >> >> >> >>> already pretty convoluted anyway. If we used something like >> >> Spring >> >> >> >>> instead of self-made configuration stuff, we would simply add >> a >> >> >> class >> >> >> >>> and reference the class in the configuration. Additionally, we >> >> >> should >> >> >> >>> use different configuration objects for each language. That >> >> doesn't >> >> >> >>> have to mean that we need a separate configuration file for >> each >> >> >> >>> language, just that we have to initialize the extraction >> >> framework >> >> >> >>> differently for each language. This would also make UriPolicy >> >> >> >>> configuration easier. >> >> >> >>> >> >> >> >>> JC >> >> >> >> >> >> >> >> >> >> >> >> I am with you :) >> >> >> >> What about Typesafe Config? [1] >> >> >> >> >> >> >> >> [1] https://github.com/typesafehub/config >> >> >> >> >> >> >> >> Andrea >> >> >> >> >> >> >> >>> >> >> >> >>> >> >> >> >>> > >> >> >> >>> > Cheers! >> >> >> >>> > Andrea >> >> >> >>> > >> >> >> >>> >> >> >> >> >>> >> Cheers, >> >> >> >>> >> JC >> >> >> >>> >> >> >> >> >>> >> > >> >> >> >>> >> > Cheers >> >> >> >>> >> > Andrea >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > 2013/12/3 Jona Christopher Sahnwaldt <[email protected]> >> >> >> >>> >> >> >> >> >> >>> >> >> Hi all, >> >> >> >>> >> >> >> >> >> >>> >> >> I don't think UriPolicy is a good place to do this... >> >> >> >>> >> >> >> >> >> >>> >> >> But anyway, I don't understand the problem yet. :-) >> >> >> >>> >> >> >> >> >> >>> >> >> Uros, you wrote about ISO 8859-2 and ISO 15924. >> >> >> >>> >> >> >> >> >> >>> >> >> ISO 8859-2 is a character encoding, but I'm pretty sure >> >> that >> >> >> >>> >> >> Wikipedia >> >> >> >>> >> >> is not using it, and I know that DBpedia is not using >> it. I >> >> >> think >> >> >> >>> >> >> Wikipedia uses UTF-8 all over the place. I know that the >> >> >> Wikipedia >> >> >> >>> >> >> XML >> >> >> >>> >> >> dumps are UTF-8 encoded, and so are the DBpedia dumps. >> >> >> >>> >> >> >> >> >> >>> >> >> ISO 15924 is not a character encoding, but a way to >> specify >> >> >> the >> >> >> >>> >> >> names >> >> >> >>> >> >> of scripts. See https://en.wikipedia.org/wiki/ISO_15924 >> >> >> >>> >> >> >> >> >> >>> >> >> What would romanization or cyrillization do exactly? Is >> >> there >> >> >> a >> >> >> >>> >> >> one-to-one mapping between letters? Or letter sequences? >> >> >> >>> >> >> >> >> >> >>> >> >> Cheers, >> >> >> >>> >> >> JC >> >> >> >>> >> >> >> >> >> >>> >> >> On 3 December 2013 16:02, Dimitris Kontokostas < >> >> >> [email protected]> >> >> >> >>> >> >> wrote: >> >> >> >>> >> >> > Hi Uros, >> >> >> >>> >> >> > >> >> >> >>> >> >> > Don't worry, as we said we are here to help if you get >> >> >> stuck;) >> >> >> we >> >> >> >>> >> >> > all >> >> >> >>> >> >> > started like this. >> >> >> >>> >> >> > >> >> >> >>> >> >> > If you look at the formatters package you will >> understand >> >> >> what's >> >> >> >>> >> >> > going >> >> >> >>> >> >> > on. >> >> >> >>> >> >> > We have formatters that write a triple based on some >> >> >> policies >> >> >> we >> >> >> >>> >> >> > define. >> >> >> >>> >> >> > We parse the policies at runtime, create formatters >> based >> >> on >> >> >> these >> >> >> >>> >> >> > policies >> >> >> >>> >> >> > and feed them to destinations. >> >> >> >>> >> >> > >> >> >> >>> >> >> > I think we should generalize URIPolicy to TriplePolicy >> >> and >> >> >> create >> >> >> >>> >> >> > a >> >> >> >>> >> >> > "transliterate" action. >> >> >> >>> >> >> > I made a change in the URIPolicy code to make it more >> >> >> descriptive >> >> >> >>> >> >> > [1] >> >> >> >>> >> >> > Right now we have support only for URIs but if you >> change >> >> >> the >> >> >> >>> >> >> > following >> >> >> >>> >> >> > it >> >> >> >>> >> >> > should be a good start to make your changes >> >> >> >>> >> >> > >> >> >> >>> >> >> > //String: Uri or Literal, Boolean: is URI or not, >> >> String: >> >> >> output >> >> >> >>> >> >> > (new >> >> >> >>> >> >> > URI >> >> >> >>> >> >> > or transliterated string) >> >> >> >>> >> >> > type Policy = (String, Boolean) => String >> >> >> >>> >> >> > >> >> >> >>> >> >> > type PolicyApplicable = (String, Boolean) => Boolean >> >> >> >>> >> >> > >> >> >> >>> >> >> > I also submitted a feature request [2], you can make a >> >> >> proper >> >> >> >>> >> >> > description >> >> >> >>> >> >> > and continue the discussion there >> >> >> >>> >> >> > >> >> >> >>> >> >> > Cheers, >> >> >> >>> >> >> > Dimitris >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > [1] >> >> https://github.com/dbpedia/extraction-framework/pull/131 >> >> >> >>> >> >> > [2] >> >> >> https://github.com/dbpedia/extraction-framework/issues/130 >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic >> >> >> >>> >> >> > <[email protected]> >> >> >> >>> >> >> > wrote: >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Hi Andrea/Dimitris, >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Thanks for the tips. Actually, when I said I was no >> core >> >> >> expert, >> >> >> >>> >> >> >> I >> >> >> >>> >> >> >> meant I >> >> >> >>> >> >> >> was an absolute beginner. :) I wanted to go with an >> >> >> extractor >> >> >> >>> >> >> >> because >> >> >> >>> >> >> >> that >> >> >> >>> >> >> >> seemed simpler (and safer) than meddling with the >> core. >> >> >> Most >> >> >> of >> >> >> >>> >> >> >> the >> >> >> >>> >> >> >> stuff >> >> >> >>> >> >> >> in there still seems rather confusing, but I'll look >> >> into >> >> >> it. >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> So, the UriPolicy code is where the triples get >> written >> >> >> (pointer >> >> >> >>> >> >> >> to >> >> >> >>> >> >> >> the >> >> >> >>> >> >> >> exact line, anyone?), or is this simply where you'd >> like >> >> to >> >> >> place >> >> >> >>> >> >> >> the >> >> >> >>> >> >> >> new >> >> >> >>> >> >> >> code? Also, would "UriPolicy" remain an adequate name >> >> for >> >> >> the >> >> >> >>> >> >> >> class, >> >> >> >>> >> >> >> then? >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Best, >> >> >> >>> >> >> >> Uros >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> > Maybe something like: >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > where you specify a list of >> >> (languageTag:transliterator) >> >> >> >>> >> >> >> > separated >> >> >> >>> >> >> >> > by >> >> >> >>> >> >> >> > ';' >> >> >> >>> >> >> >> > for one language? >> >> >> >>> >> >> >> > The transliterator could be either "identity" (no >> >> >> >>> >> >> >> > transformation) >> >> >> >>> >> >> >> > or >> >> >> >>> >> >> >> > a >> >> >> >>> >> >> >> > icu4j transliterator-ID. >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > As Dimitris said, Uros please feel free to ask if >> you >> >> >> need >> >> >> >>> >> >> >> > help! >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > Cheers >> >> >> >>> >> >> >> > Andrea >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > 2013/11/30 Dimitris Kontokostas <[email protected]> >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna >> >> >> >>> >> >> >> >> <[email protected]>wrote: >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >>> Hello Uros, >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> that's a really interesting problem :) >> >> >> >>> >> >> >> >>> I am no expert either but probably the best >> approach >> >> >> would be >> >> >> >>> >> >> >> >>> to >> >> >> >>> >> >> >> >>> "duplicate" triples when they are going to be >> >> written >> >> >> (e.g. >> >> >> >>> >> >> >> >>> in >> >> >> >>> >> >> >> >>> the >> >> >> >>> >> >> >> >>> destinations package), instead of modifying the >> >> >> extractors. >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> I agree, I'd suggest we extend the UriPolicy [1] >> >> >> functionality >> >> >> >>> >> >> >> >> to >> >> >> >>> >> >> >> >> do >> >> >> >>> >> >> >> >> string object transformations (now it only applies >> to >> >> >> URIs >> >> >> / >> >> >> >>> >> >> >> >> IRIs) >> >> >> >>> >> >> >> >> and use the configuration files to select the >> desired >> >> >> output >> >> >> >>> >> >> >> >> [2]. >> >> >> >>> >> >> >> >> Uros, do you want to give it a shot? You can >> always >> >> ask >> >> >> for >> >> >> >>> >> >> >> >> help >> >> >> >>> >> >> >> >> here >> >> >> >>> >> >> >> >> ;) >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> [1] >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala >> >> >> >>> >> >> >> >> [2] >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130 >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >>> For what regards which tools to use, it looks >> like >> >> >> icu4j >> >> >> >>> >> >> >> >>> Translitterator >> >> >> >>> >> >> >> >>> suits your needs, e.g. >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор >> >> >> >>> >> >> >> >>> 5 >> >> >> >>> >> >> >> >>> (енгл. Malachor V) је измишљена >> >> >> планета у >> >> >> >>> >> >> >> >>> универзуму Ратова звезда.") >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> results in >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> Malakor 5 (engl. Malachor V) je izmišljena >> planeta >> >> u >> >> >> >>> >> >> >> >>> univerzumu >> >> >> >>> >> >> >> >>> Ratova >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >>> zvezda. >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> What do you think? >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> Cheers >> >> >> >>> >> >> >> >>> Andrea >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> 2013/11/29 Uros Milosevic >> <[email protected]> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>>> Hi all, >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> As some of you may know, a Serbian version of >> >> DBpedia >> >> >> is >> >> >> >>> >> >> >> >>>> currently >> >> >> >>> >> >> >> >>>> in >> >> >> >>> >> >> >> >>>> the >> >> >> >>> >> >> >> >>>> works. Now, Serbian, unlike any other language >> in >> >> >> Europe, is >> >> >> >>> >> >> >> >>>> digraphic >> >> >> >>> >> >> >> >>>> in >> >> >> >>> >> >> >> >>>> nature, officially supporting both (Serbian) >> >> Cyrillic >> >> >> and >> >> >> >>> >> >> >> >>>> (Gaj's) >> >> >> >>> >> >> >> >>>> Latin >> >> >> >>> >> >> >> >>>> alphabet. This is absolutely fine for storing >> >> >> information in >> >> >> >>> >> >> >> >>>> any >> >> >> >>> >> >> >> >>>> modern >> >> >> >>> >> >> >> >>>> knowledge base, but can often be a major >> obstacle >> >> for >> >> >> >>> >> >> >> >>>> information >> >> >> >>> >> >> >> >>>> retrieval. >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> For instance, most Serbs rely on the Latin >> alphabet >> >> >> for >> >> >> >>> >> >> >> >>>> communication/interaction on the Web. That means >> a >> >> >> large >> >> >> >>> >> >> >> >>>> portion >> >> >> >>> >> >> >> >>>> of >> >> >> >>> >> >> >> >>>> the >> >> >> >>> >> >> >> >>>> information is (and, often, expected to be) >> encoded >> >> in >> >> >> ISO >> >> >> >>> >> >> >> >>>> 8859-2 >> >> >> >>> >> >> >> >>>> (i.e. >> >> >> >>> >> >> >> >>>> Latin-2). And, yet, 99% of the information in >> the >> >> >> Serbian >> >> >> >>> >> >> >> >>>> Wikipedia >> >> >> >>> >> >> >> >>>> dumps >> >> >> >>> >> >> >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, >> unless >> >> >> your >> >> >> >>> >> >> >> >>>> software >> >> >> >>> >> >> >> >>>> performs >> >> >> >>> >> >> >> >>>> romanization (i.e. converts Cyrillic to Latin) >> or >> >> >> >>> >> >> >> >>>> cyrillization >> >> >> >>> >> >> >> >>>> (i.e. >> >> >> >>> >> >> >> >>>> vice >> >> >> >>> >> >> >> >>>> versa) on-the-fly, at retrieval time (Wikipedia >> >> >> appears >> >> >> to >> >> >> >>> >> >> >> >>>> be >> >> >> >>> >> >> >> >>>> doing >> >> >> >>> >> >> >> >>>> this), >> >> >> >>> >> >> >> >>>> many attempts at information extraction will be >> >> doomed >> >> >> to >> >> >> >>> >> >> >> >>>> fail. >> >> >> >>> >> >> >> >>>> This >> >> >> >>> >> >> >> >>>> directly affects common tasks such as keyword >> >> search, >> >> >> >>> >> >> >> >>>> label-based >> >> >> >>> >> >> >> >>>> SPARQL >> >> >> >>> >> >> >> >>>> querying, named entity recognition, etc. >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> What I would like to do is improve some of the >> >> >> existing >> >> >> >>> >> >> >> >>>> DBpedia >> >> >> >>> >> >> >> >>>> extractors, or develop new ones, that would take >> >> this >> >> >> >>> >> >> >> >>>> problem >> >> >> >>> >> >> >> >>>> into >> >> >> >>> >> >> >> >>>> consideration and perform romanization of >> Wikipedia >> >> >> dumps so >> >> >> >>> >> >> >> >>>> as >> >> >> >>> >> >> >> >>>> to >> >> >> >>> >> >> >> >>>> output >> >> >> >>> >> >> >> >>>> information encoded in *both* scripts. Now, I >> know >> >> >> storing >> >> >> >>> >> >> >> >>>> the >> >> >> >>> >> >> >> >>>> same >> >> >> >>> >> >> >> >>>> information twice might not be the most elegant >> >> >> solution, >> >> >> >>> >> >> >> >>>> but >> >> >> >>> >> >> >> >>>> unless >> >> >> >>> >> >> >> >>>> someone is to include romanization/cyrillization >> >> >> features in >> >> >> >>> >> >> >> >>>> the >> >> >> >>> >> >> >> >>>> next >> >> >> >>> >> >> >> >>>> version of SPARQL, I don't see a better solution >> at >> >> >> the >> >> >> >>> >> >> >> >>>> moment. >> >> >> >>> >> >> >> >>>> Of >> >> >> >>> >> >> >> >>>> course, >> >> >> >>> >> >> >> >>>> there is also the matter of perspective - one >> could >> >> >> argue >> >> >> >>> >> >> >> >>>> that >> >> >> >>> >> >> >> >>>> although >> >> >> >>> >> >> >> >>>> the information is the same, the very fact that >> >> >> different >> >> >> >>> >> >> >> >>>> character >> >> >> >>> >> >> >> >>>> sequences are needed to describe the same piece >> of >> >> >> knowledge >> >> >> >>> >> >> >> >>>> makes >> >> >> >>> >> >> >> >>>> this >> >> >> >>> >> >> >> >>>> problem fall into the domain of multilinguality. >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> So, the general idea is to use a single IRI per >> >> >> resource, >> >> >> >>> >> >> >> >>>> but >> >> >> >>> >> >> >> >>>> have >> >> >> >>> >> >> >> >>>> two >> >> >> >>> >> >> >> >>>> separate triples for any literal originally >> encoded >> >> in >> >> >> >>> >> >> >> >>>> cyrillic. >> >> >> >>> >> >> >> >>>> For >> >> >> >>> >> >> >> >>>> example: >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> < >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >> >> >> http://sr.dbpedia.org/resource/Парсер< >> >> >> >> >> >> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088> >> < >> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088 >> > >> >> < >> >> >> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088 >> >> > >> >> >> > >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >>>> ;> >> >> >> >>> >> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> >> >> >> >>> >> >> >> >>>> >> "Парсер"@sr-Cyrl >> >> . >> >> >> >>> >> >> >> >>>> < >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >> >> >> http://sr.dbpedia.org/resource/Парсер< >> >> >> >> >> >> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088> >> < >> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088 >> > >> >> < >> >> >> http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088 >> >> > >> >> >> > >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >>>> ;> >> >> >> >>> >> >> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> >> >> >> >>> >> >> >> >>>> "Parser"@sr-Latn . >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> The above language tags are as per IANA Language >> >> >> Subtag >> >> >> >>> >> >> >> >>>> Registry >> >> >> >>> >> >> >> >>>> [1], >> >> >> >>> >> >> >> >>>> which lists them as redundant, though, so a "sr" >> >> tag, >> >> >> >>> >> >> >> >>>> instead, >> >> >> >>> >> >> >> >>>> could >> >> >> >>> >> >> >> >>>> be >> >> >> >>> >> >> >> >>>> enough for both. >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> I'm no DBpedia core expert, so some tips, ideas, >> >> >> directions >> >> >> >>> >> >> >> >>>> or >> >> >> >>> >> >> >> >>>> any >> >> >> >>> >> >> >> >>>> other >> >> >> >>> >> >> >> >>>> information that would help me get started would >> be >> >> >> much >> >> >> >>> >> >> >> >>>> appreciated! >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> Best, >> >> >> >>> >> >> >> >>>> Uros >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> [1] >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >> >> >> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >>> >> >> >> >>>> Rapidly troubleshoot problems before they affect >> >> your >> >> >> >>> >> >> >> >>>> business. >> >> >> >>> >> >> >> >>>> Most >> >> >> >>> >> >> >> >>>> IT >> >> >> >>> >> >> >> >>>> organizations don't have a clear picture of how >> >> >> application >> >> >> >>> >> >> >> >>>> performance >> >> >> >>> >> >> >> >>>> affects their revenue. With AppDynamics, you get >> >> 100% >> >> >> >>> >> >> >> >>>> visibility >> >> >> >>> >> >> >> >>>> into >> >> >> >>> >> >> >> >>>> your >> >> >> >>> >> >> >> >>>> Java,.NET, & PHP application. Start your 15-day >> >> FREE >> >> >> TRIAL >> >> >> >>> >> >> >> >>>> of >> >> >> >>> >> >> >> >>>> AppDynamics Pro! >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >> >>> >> >> >> >>>> _______________________________________________ >> >> >> >>> >> >> >> >>>> Dbpedia-developers mailing list >> >> >> >>> >> >> >> >>>> [email protected] >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >>> >> >> >> >>> Rapidly troubleshoot problems before they affect >> >> your >> >> >> >>> >> >> >> >>> business. >> >> >> >>> >> >> >> >>> Most >> >> >> >>> >> >> >> >>> IT >> >> >> >>> >> >> >> >>> organizations don't have a clear picture of how >> >> >> application >> >> >> >>> >> >> >> >>> performance >> >> >> >>> >> >> >> >>> affects their revenue. With AppDynamics, you get >> >> 100% >> >> >> >>> >> >> >> >>> visibility >> >> >> >>> >> >> >> >>> into >> >> >> >>> >> >> >> >>> your >> >> >> >>> >> >> >> >>> Java,.NET, & PHP application. Start your 15-day >> FREE >> >> >> TRIAL of >> >> >> >>> >> >> >> >>> AppDynamics >> >> >> >>> >> >> >> >>> Pro! >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >> >>> >> >> >> >>> _______________________________________________ >> >> >> >>> >> >> >> >>> Dbpedia-developers mailing list >> >> >> >>> >> >> >> >>> [email protected] >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> -- >> >> >> >>> >> >> >> >> Kontokostas Dimitris >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > -- >> >> >> >>> >> >> > Kontokostas Dimitris >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >>> >> >> > Rapidly troubleshoot problems before they affect your >> >> >> business. >> >> >> >>> >> >> > Most >> >> >> >>> >> >> > IT >> >> >> >>> >> >> > organizations don't have a clear picture of how >> >> application >> >> >> >>> >> >> > performance >> >> >> >>> >> >> > affects their revenue. With AppDynamics, you get 100% >> >> >> visibility >> >> >> >>> >> >> > into >> >> >> >>> >> >> > your >> >> >> >>> >> >> > Java,.NET, & PHP application. Start your 15-day FREE >> >> TRIAL >> >> >> of >> >> >> >>> >> >> > AppDynamics >> >> >> >>> >> >> > Pro! >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >> >>> >> >> > _______________________________________________ >> >> >> >>> >> >> > Dbpedia-developers mailing list >> >> >> >>> >> >> > [email protected] >> >> >> >>> >> >> > >> >> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >> >>> >> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >>> >> >> Rapidly troubleshoot problems before they affect your >> >> >> business. >> >> >> Most >> >> >> >>> >> >> IT >> >> >> >>> >> >> organizations don't have a clear picture of how >> application >> >> >> >>> >> >> performance >> >> >> >>> >> >> affects their revenue. With AppDynamics, you get 100% >> >> >> visibility >> >> >> >>> >> >> into >> >> >> >>> >> >> your >> >> >> >>> >> >> Java,.NET, & PHP application. Start your 15-day FREE >> TRIAL >> >> of >> >> >> >>> >> >> AppDynamics >> >> >> >>> >> >> Pro! >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >> >> >>> >> >> _______________________________________________ >> >> >> >>> >> >> Dbpedia-developers mailing list >> >> >> >>> >> >> [email protected] >> >> >> >>> >> >> >> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> > >> >> >> >>> > >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Kontokostas Dimitris >> >> > >> >> >> ------------------------------------------------------------------------------ >> >> > Sponsored by Intel(R) XDK >> >> > Develop, test and display web and hybrid apps with a single code >> base. >> >> > Download it for free now! >> >> > >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk_______________________________________________ >> >> > Dbpedia-developers mailing list >> >> > [email protected] >> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >> > >> >> >> >> >> >> >> > >> > >> > -- >> > Kontokostas Dimitris >> > >> >> > ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
