Hi Uros,
Don't worry, as we said we are here to help if you get stuck;) we all
started like this.
If you look at the formatters package you will understand what's going on.
We have formatters that write a triple based on some policies we define.
We parse the policies at runtime, create formatters based on these policies
and feed them to destinations.
I think we should generalize URIPolicy to TriplePolicy and create a
"transliterate" action.
I made a change in the URIPolicy code to make it more descriptive [1]
Right now we have support only for URIs but if you change the following it
should be a good start to make your changes
//String: Uri or Literal, Boolean: is URI or not, String: output (new URI
or transliterated string)
type Policy = (String, Boolean) => String
type PolicyApplicable = (String, Boolean) => Boolean
I also submitted a feature request [2], you can make a proper description
and continue the discussion there
Cheers,
Dimitris
[1] https://github.com/dbpedia/extraction-framework/pull/131
[2] https://github.com/dbpedia/extraction-framework/issues/130
On Mon, Dec 2, 2013 at 5:50 PM, Uros Milosevic <[email protected]>wrote:
> Hi Andrea/Dimitris,
>
> Thanks for the tips. Actually, when I said I was no core expert, I meant I
> was an absolute beginner. :) I wanted to go with an extractor because that
> seemed simpler (and safer) than meddling with the core. Most of the stuff
> in there still seems rather confusing, but I'll look into it.
>
> So, the UriPolicy code is where the triples get written (pointer to the
> exact line, anyone?), or is this simply where you'd like to place the new
> code? Also, would "UriPolicy" remain an adequate name for the class, then?
>
> Best,
> Uros
>
>
> > Maybe something like:
> >
> > script.sr=sr-Cyrl:identity;sr-Latn:Serbian-Latin/BGN
> >
> > where you specify a list of (languageTag:transliterator) separated by ';'
> > for one language?
> > The transliterator could be either "identity" (no transformation) or a
> > icu4j transliterator-ID.
> >
> > As Dimitris said, Uros please feel free to ask if you need help!
> >
> > Cheers
> > Andrea
> >
> >
> > 2013/11/30 Dimitris Kontokostas <[email protected]>
> >
> >>
> >>
> >>
> >> On Fri, Nov 29, 2013 at 5:02 PM, Andrea Di Menna
> >> <[email protected]>wrote:
> >>
> >>> Hello Uros,
> >>>
> >>> that's a really interesting problem :)
> >>> I am no expert either but probably the best approach would be to
> >>> "duplicate" triples when they are going to be written (e.g. in the
> >>> destinations package), instead of modifying the extractors.
> >>>
> >>
> >> I agree, I'd suggest we extend the UriPolicy [1] functionality to do
> >> string object transformations (now it only applies to URIs / IRIs)
> >> and use the configuration files to select the desired output [2].
> >> Uros, do you want to give it a shot? You can always ask for help here ;)
> >>
> >> [1]
> >>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/UriPolicy.scala
> >> [2]
> >>
> https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.default.properties#L130
> >>
> >>
> >>> For what regards which tools to use, it looks like icu4j
> >>> Translitterator
> >>> suits your needs, e.g.
> >>>
> >>> Transliterator.getInstance("Serbian-Latin/BGN").transliterate("Малакор
> >>> 5
> >>> (енгл. Malachor V) је измишљена планета у
> >>> универзуму Ратова звезда.")
> >>>
> >>> results in
> >>>
> >>> Malakor 5 (engl. Malachor V) je izmišljena planeta u univerzumu Ratova
> >>> zvezda.
> >>>
> >>> What do you think?
> >>>
> >>> Cheers
> >>> Andrea
> >>>
> >>>
> >>> 2013/11/29 Uros Milosevic <[email protected]>
> >>>
> >>>> Hi all,
> >>>>
> >>>> As some of you may know, a Serbian version of DBpedia is currently in
> >>>> the
> >>>> works. Now, Serbian, unlike any other language in Europe, is digraphic
> >>>> in
> >>>> nature, officially supporting both (Serbian) Cyrillic and (Gaj's)
> >>>> Latin
> >>>> alphabet. This is absolutely fine for storing information in any
> >>>> modern
> >>>> knowledge base, but can often be a major obstacle for information
> >>>> retrieval.
> >>>>
> >>>> For instance, most Serbs rely on the Latin alphabet for
> >>>> communication/interaction on the Web. That means a large portion of
> >>>> the
> >>>> information is (and, often, expected to be) encoded in ISO 8859-2
> >>>> (i.e.
> >>>> Latin-2). And, yet, 99% of the information in the Serbian Wikipedia
> >>>> dumps
> >>>> is encoded in ISO 15924 (i.e. Cyrillic). So, unless your software
> >>>> performs
> >>>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
> >>>> vice
> >>>> versa) on-the-fly, at retrieval time (Wikipedia appears to be doing
> >>>> this),
> >>>> many attempts at information extraction will be doomed to fail. This
> >>>> directly affects common tasks such as keyword search, label-based
> >>>> SPARQL
> >>>> querying, named entity recognition, etc.
> >>>>
> >>>> What I would like to do is improve some of the existing DBpedia
> >>>> extractors, or develop new ones, that would take this problem into
> >>>> consideration and perform romanization of Wikipedia dumps so as to
> >>>> output
> >>>> information encoded in *both* scripts. Now, I know storing the same
> >>>> information twice might not be the most elegant solution, but unless
> >>>> someone is to include romanization/cyrillization features in the next
> >>>> version of SPARQL, I don't see a better solution at the moment. Of
> >>>> course,
> >>>> there is also the matter of perspective - one could argue that
> >>>> although
> >>>> the information is the same, the very fact that different character
> >>>> sequences are needed to describe the same piece of knowledge makes
> >>>> this
> >>>> problem fall into the domain of multilinguality.
> >>>>
> >>>> So, the general idea is to use a single IRI per resource, but have two
> >>>> separate triples for any literal originally encoded in cyrillic. For
> >>>> example:
> >>>>
> >>>> <
> >>>>
> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
> <
> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
> >
> >>>> ;>
> >>>> <http://www.w3.org/2000/01/rdf-schema#label>
> >>>> "Парсер"@sr-Cyrl .
> >>>> <
> >>>>
> http://sr.dbpedia.org/resource/Парсер<http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088>
> <
> http://sr.dbpedia.org/resource/П&%231072;&%231088;&%231089;&%231077;&%231088<http://sr.dbpedia.org/resource/%D0%9F&%231072;&%231088;&%231089;&%231077;&%231088>
> >
> >>>> ;>
> >>>> <http://www.w3.org/2000/01/rdf-schema#label> "Parser"@sr-Latn .
> >>>>
> >>>> The above language tags are as per IANA Language Subtag Registry [1],
> >>>> which lists them as redundant, though, so a "sr" tag, instead, could
> >>>> be
> >>>> enough for both.
> >>>>
> >>>> I'm no DBpedia core expert, so some tips, ideas, directions or any
> >>>> other
> >>>> information that would help me get started would be much appreciated!
> >>>>
> >>>> Best,
> >>>> Uros
> >>>>
> >>>> [1]
> >>>>
> >>>>
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>> Rapidly troubleshoot problems before they affect your business. Most
> >>>> IT
> >>>> organizations don't have a clear picture of how application
> >>>> performance
> >>>> affects their revenue. With AppDynamics, you get 100% visibility into
> >>>> your
> >>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> >>>> AppDynamics Pro!
> >>>>
> >>>>
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> >>>> _______________________________________________
> >>>> Dbpedia-developers mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >>>>
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Rapidly troubleshoot problems before they affect your business. Most IT
> >>> organizations don't have a clear picture of how application performance
> >>> affects their revenue. With AppDynamics, you get 100% visibility into
> >>> your
> >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> >>> AppDynamics
> >>> Pro!
> >>>
> >>>
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> >>> _______________________________________________
> >>> Dbpedia-developers mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >>>
> >>>
> >>
> >>
> >> --
> >> Kontokostas Dimitris
> >>
> >
>
>
>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers