Maybe you could post-process the DBpedia dumps with this tool? http://www.huge-man-linux.net/man1/recode-sr-latin.html
On 3 December 2013 16:33, Jona Christopher Sahnwaldt <[email protected]> wrote: > Hi all, > > I don't think UriPolicy is a good place to do this... > > But anyway, I don't understand the problem yet. :-) > > Uros, you wrote about ISO 8859-2 and ISO 15924. > > ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia > is not using it, and I know that DBpedia is not using it. I think > Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML > dumps are UTF-8 encoded, and so are the DBpedia dumps. > > ISO 15924 is not a character encoding, but a way to specify the names > of scripts. See https://en.wikipedia.org/wiki/ISO_15924 > > What would romanization or cyrillization do exactly? Is there a > one-to-one mapping between letters? Or letter sequences? > > Cheers, > JC > > On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote: >> Hi Uros, >> >> Don't worry, as we said we are here to help if you get stuck;) we all >> started like this. >> >> If you look at the formatters package you will understand what's going on. >> We have formatters that write a triple based on some policies we define. >> We parse the policies at runtime, create formatters based on these policies >> and feed them to destinations. >> >> I think we should generalize URIPolicy to TriplePolicy and create a >> "transliterate" action. >> I made a change in the URIPolicy code to make it more descriptive [1] >> Right now we have support only for URIs but if you change the following it >> should be a good start to make your changes >> >> //String: Uri or Literal, Boolean: is URI or not, String: output (new URI >> or transliterated string) >> type Policy = (String, Boolean) => String >> >> type PolicyApplicable = (String, Boolean) => Boolean >> >> I also submitted a feature request [2], you can make a proper description >> and continue the discussion there >> >> Cheers, >> Dimitris >> >> >> [1] https://github.com/dbpedia/extraction-framework/pull/131 >> [2] https://github.com/dbpedia/extraction-framework/issues/130 >> >> >> On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]> >> wrote: >>> >>> Hi Andrea/Dimitris, >>> >>> Thanks for the tips. Actually, when I said I was no core expert, I meant I >>> was an absolute beginner. :) I wanted to go with an extractor because that >>> seemed simpler (and safer) than meddling with the core. Most of the stuff >>> in there still seems rather confusing, but I'll look into it. >>> >>> So, the UriPolicy code is where the triples get written (pointer to the >>> exact line, anyone?), or is this simply where you'd like to place the new >>> code? Also, would "UriPolicy" remain an adequate name for the class, then? >>> >>> Best, >>> Uros >>> >>> >>> > Maybe something like: >>> > >>> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN >>> > >>> > where you specify a list of (languageTag:transliterator) separated by >>> > ';' >>> > for one language? >>> > The transliterator could be either "identity" (no transformation) or a >>> > icu4j transliterator-ID. >>> > >>> > As Dimitris said, Uros please feel free to ask if you need help! >>> > >>> > Cheers >>> > Andrea >>> > >>> > >>> > 2013/11/30 Dimitris Kontokostas <[email protected]> >>> > >>> >> >>> >> >>> >> >>> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna >>> >> <[email protected]>wrote: >>> >> >>> >>> Hello Uros, >>> >>> >>> >>> that's a really interesting problem :) >>> >>> I am no expert either but probably the best approach would be to >>> >>> "duplicate" triples when they are going to be written (e.g. in the >>> >>> destinations package), instead of modifying the extractors. >>> >>> >>> >> >>> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do >>> >> string object transformations (now it only applies to URIs / IRIs) >>> >> and use the configuration files to select the desired output [2]. >>> >> Uros, do you want to give it a shot? You can always ask for help here >>> >> ;) >>> >> >>> >> [1] >>> >> >>> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala >>> >> [2] >>> >> >>> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130 >>> >> >>> >> >>> >>> For what regards which tools to use, it looks like icu4j >>> >>> Translitterator >>> >>> suits your needs, e.g. >>> >>> >>> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор >>> >>> 5 >>> >>> (енгл. Malachor V) је измишљена планета у >>> >>> универзуму Ратова звезда.") >>> >>> >>> >>> results in >>> >>> >>> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova >>> >>> >>> zvezda. >>> >>> >>> >>> What do you think? >>> >>> >>> >>> Cheers >>> >>> Andrea >>> >>> >>> >>> >>> >>> 2013/11/29 Uros Milosevic <[email protected]> >>> >>> >>> >>>> Hi all, >>> >>>> >>> >>>> As some of you may know, a Serbian version of DBpedia is currently in >>> >>>> the >>> >>>> works. Now, Serbian, unlike any other language in Europe, is >>> >>>> digraphic >>> >>>> in >>> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) >>> >>>> Latin >>> >>>> alphabet. This is absolutely fine for storing information in any >>> >>>> modern >>> >>>> knowledge base, but can often be a major obstacle for information >>> >>>> retrieval. >>> >>>> >>> >>>> For instance, most Serbs rely on the Latin alphabet for >>> >>>> communication/interaction on the Web. That means a large portion of >>> >>>> the >>> >>>> information is (and, often, expected to be) encoded in ISO 8859-2 >>> >>>> (i.e. >>> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia >>> >>>> dumps >>> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software >>> >>>> performs >>> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. >>> >>>> vice >>> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing >>> >>>> this), >>> >>>> many attempts at information extraction will be doomed to fail. This >>> >>>> directly affects common tasks such as keyword search, label-based >>> >>>> SPARQL >>> >>>> querying, named entity recognition, etc. >>> >>>> >>> >>>> What I would like to do is improve some of the existing DBpedia >>> >>>> extractors, or develop new ones, that would take this problem into >>> >>>> consideration and perform romanization of Wikipedia dumps so as to >>> >>>> output >>> >>>> information encoded in *both* scripts. Now, I know storing the same >>> >>>> information twice might not be the most elegant solution, but unless >>> >>>> someone is to include romanization/cyrillization features in the next >>> >>>> version of SPARQL, I don't see a better solution at the moment. Of >>> >>>> course, >>> >>>> there is also the matter of perspective - one could argue that >>> >>>> although >>> >>>> the information is the same, the very fact that different character >>> >>>> sequences are needed to describe the same piece of knowledge makes >>> >>>> this >>> >>>> problem fall into the domain of multilinguality. >>> >>>> >>> >>>> So, the general idea is to use a single IRI per resource, but have >>> >>>> two >>> >>>> separate triples for any literal originally encoded in cyrillic. For >>> >>>> example: >>> >>>> >>> >>>> < >>> >>>> >>> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >>> >>> >>>> ;> >>> >>>> <http://www.w3.org/2000/01/rdf-schema#label> >>> >>>> "Парсер"@sr-Cyrl . >>> >>>> < >>> >>>> >>> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >>> >>> >>>> ;> >>> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn . >>> >>>> >>> >>>> The above language tags are as per IANA Language Subtag Registry [1], >>> >>>> which lists them as redundant, though, so a "sr" tag, instead, could >>> >>>> be >>> >>>> enough for both. >>> >>>> >>> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any >>> >>>> other >>> >>>> information that would help me get started would be much appreciated! >>> >>>> >>> >>>> Best, >>> >>>> Uros >>> >>>> >>> >>>> [1] >>> >>>> >>> >>>> >>> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> ------------------------------------------------------------------------------ >>> >>>> Rapidly troubleshoot problems before they affect your business. Most >>> >>>> IT >>> >>>> organizations don't have a clear picture of how application >>> >>>> performance >>> >>>> affects their revenue. With AppDynamics, you get 100% visibility into >>> >>>> your >>> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >>> >>>> AppDynamics Pro! >>> >>>> >>> >>>> >>> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >>> >>>> _______________________________________________ >>> >>>> Dbpedia-developers mailing list >>> >>>> [email protected] >>> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> Rapidly troubleshoot problems before they affect your business. Most >>> >>> IT >>> >>> organizations don't have a clear picture of how application >>> >>> performance >>> >>> affects their revenue. With AppDynamics, you get 100% visibility into >>> >>> your >>> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >>> >>> AppDynamics >>> >>> Pro! >>> >>> >>> >>> >>> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >>> >>> _______________________________________________ >>> >>> Dbpedia-developers mailing list >>> >>> [email protected] >>> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >>> >>> >>> >>> >>> >> >>> >> >>> >> -- >>> >> Kontokostas Dimitris >>> >> >>> > >>> >>> >> >> >> >> -- >> Kontokostas Dimitris >> >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> _______________________________________________ >> Dbpedia-developers mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
