Hi all, I don't think UriPolicy is a good place to do this...
But anyway, I don't understand the problem yet. :-) Uros, you wrote about ISO 8859-2 and ISO 15924. ISO 8859-2 is a character encoding, but I'm pretty sure that Wikipedia is not using it, and I know that DBpedia is not using it. I think Wikipedia uses UTF-8 all over the place. I know that the Wikipedia XML dumps are UTF-8 encoded, and so are the DBpedia dumps. ISO 15924 is not a character encoding, but a way to specify the names of scripts. See https://en.wikipedia.org/wiki/ISO_15924 What would romanization or cyrillization do exactly? Is there a one-to-one mapping between letters? Or letter sequences? Cheers, JC On 3 December 2013 16:02, Dimitris Kontokostas <[email protected]> wrote: > Hi Uros, > > Don't worry, as we said we are here to help if you get stuck;) we all > started like this. > > If you look at the formatters package you will understand what's going on. > We have formatters that write a triple based on some policies we define. > We parse the policies at runtime, create formatters based on these policies > and feed them to destinations. > > I think we should generalize URIPolicy to TriplePolicy and create a > "transliterate" action. > I made a change in the URIPolicy code to make it more descriptive [1] > Right now we have support only for URIs but if you change the following it > should be a good start to make your changes > > //String: Uri or Literal, Boolean: is URI or not, String: output (new URI > or transliterated string) > type Policy = (String, Boolean) => String > > type PolicyApplicable = (String, Boolean) => Boolean > > I also submitted a feature request [2], you can make a proper description > and continue the discussion there > > Cheers, > Dimitris > > > [1] https://github.com/dbpedia/extraction-framework/pull/131 > [2] https://github.com/dbpedia/extraction-framework/issues/130 > > > On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]> > wrote: >> >> Hi Andrea/Dimitris, >> >> Thanks for the tips. Actually, when I said I was no core expert, I meant I >> was an absolute beginner. :) I wanted to go with an extractor because that >> seemed simpler (and safer) than meddling with the core. Most of the stuff >> in there still seems rather confusing, but I'll look into it. >> >> So, the UriPolicy code is where the triples get written (pointer to the >> exact line, anyone?), or is this simply where you'd like to place the new >> code? Also, would "UriPolicy" remain an adequate name for the class, then? >> >> Best, >> Uros >> >> >> > Maybe something like: >> > >> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN >> > >> > where you specify a list of (languageTag:transliterator) separated by >> > ';' >> > for one language? >> > The transliterator could be either "identity" (no transformation) or a >> > icu4j transliterator-ID. >> > >> > As Dimitris said, Uros please feel free to ask if you need help! >> > >> > Cheers >> > Andrea >> > >> > >> > 2013/11/30 Dimitris Kontokostas <[email protected]> >> > >> >> >> >> >> >> >> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna >> >> <[email protected]>wrote: >> >> >> >>> Hello Uros, >> >>> >> >>> that's a really interesting problem :) >> >>> I am no expert either but probably the best approach would be to >> >>> "duplicate" triples when they are going to be written (e.g. in the >> >>> destinations package), instead of modifying the extractors. >> >>> >> >> >> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do >> >> string object transformations (now it only applies to URIs / IRIs) >> >> and use the configuration files to select the desired output [2]. >> >> Uros, do you want to give it a shot? You can always ask for help here >> >> ;) >> >> >> >> [1] >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala >> >> [2] >> >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130 >> >> >> >> >> >>> For what regards which tools to use, it looks like icu4j >> >>> Translitterator >> >>> suits your needs, e.g. >> >>> >> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор >> >>> 5 >> >>> (енгл. Malachor V) је измишљена планета у >> >>> универзуму Ратова звезда.") >> >>> >> >>> results in >> >>> >> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova >> >> >>> zvezda. >> >>> >> >>> What do you think? >> >>> >> >>> Cheers >> >>> Andrea >> >>> >> >>> >> >>> 2013/11/29 Uros Milosevic <[email protected]> >> >>> >> >>>> Hi all, >> >>>> >> >>>> As some of you may know, a Serbian version of DBpedia is currently in >> >>>> the >> >>>> works. Now, Serbian, unlike any other language in Europe, is >> >>>> digraphic >> >>>> in >> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's) >> >>>> Latin >> >>>> alphabet. This is absolutely fine for storing information in any >> >>>> modern >> >>>> knowledge base, but can often be a major obstacle for information >> >>>> retrieval. >> >>>> >> >>>> For instance, most Serbs rely on the Latin alphabet for >> >>>> communication/interaction on the Web. That means a large portion of >> >>>> the >> >>>> information is (and, often, expected to be) encoded in ISO 8859-2 >> >>>> (i.e. >> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia >> >>>> dumps >> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software >> >>>> performs >> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. >> >>>> vice >> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing >> >>>> this), >> >>>> many attempts at information extraction will be doomed to fail. This >> >>>> directly affects common tasks such as keyword search, label-based >> >>>> SPARQL >> >>>> querying, named entity recognition, etc. >> >>>> >> >>>> What I would like to do is improve some of the existing DBpedia >> >>>> extractors, or develop new ones, that would take this problem into >> >>>> consideration and perform romanization of Wikipedia dumps so as to >> >>>> output >> >>>> information encoded in *both* scripts. Now, I know storing the same >> >>>> information twice might not be the most elegant solution, but unless >> >>>> someone is to include romanization/cyrillization features in the next >> >>>> version of SPARQL, I don't see a better solution at the moment. Of >> >>>> course, >> >>>> there is also the matter of perspective - one could argue that >> >>>> although >> >>>> the information is the same, the very fact that different character >> >>>> sequences are needed to describe the same piece of knowledge makes >> >>>> this >> >>>> problem fall into the domain of multilinguality. >> >>>> >> >>>> So, the general idea is to use a single IRI per resource, but have >> >>>> two >> >>>> separate triples for any literal originally encoded in cyrillic. For >> >>>> example: >> >>>> >> >>>> < >> >>>> >> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >> >> >>>> ;> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> >> >>>> "Парсер"@sr-Cyrl . >> >>>> < >> >>>> >> >>>> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088> >> >> >>>> ;> >> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn . >> >>>> >> >>>> The above language tags are as per IANA Language Subtag Registry [1], >> >>>> which lists them as redundant, though, so a "sr" tag, instead, could >> >>>> be >> >>>> enough for both. >> >>>> >> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any >> >>>> other >> >>>> information that would help me get started would be much appreciated! >> >>>> >> >>>> Best, >> >>>> Uros >> >>>> >> >>>> [1] >> >>>> >> >>>> >> >>>> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> ------------------------------------------------------------------------------ >> >>>> Rapidly troubleshoot problems before they affect your business. Most >> >>>> IT >> >>>> organizations don't have a clear picture of how application >> >>>> performance >> >>>> affects their revenue. With AppDynamics, you get 100% visibility into >> >>>> your >> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >> >>>> AppDynamics Pro! >> >>>> >> >>>> >> >>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >>>> _______________________________________________ >> >>>> Dbpedia-developers mailing list >> >>>> [email protected] >> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >>>> >> >>> >> >>> >> >>> >> >>> >> >>> ------------------------------------------------------------------------------ >> >>> Rapidly troubleshoot problems before they affect your business. Most >> >>> IT >> >>> organizations don't have a clear picture of how application >> >>> performance >> >>> affects their revenue. With AppDynamics, you get 100% visibility into >> >>> your >> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >> >>> AppDynamics >> >>> Pro! >> >>> >> >>> >> >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> >>> _______________________________________________ >> >>> Dbpedia-developers mailing list >> >>> [email protected] >> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers >> >>> >> >>> >> >> >> >> >> >> -- >> >> Kontokostas Dimitris >> >> >> > >> >> > > > > -- > Kontokostas Dimitris > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk > _______________________________________________ > Dbpedia-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
