Re: [Apertium-stuff] Machine translation for Wikipedia

David Cuenca Tue, 23 Jul 2013 06:46:01 -0700

On Mon, Jul 22, 2013 at 6:14 PM, Jimmy O'Regan <[email protected]> wrote:


> I'm Boldizsár's mentor :) His work so far has been towards a more
> declarative way of writing rules, so outputting rules in the format I
> have in mind would be a trivial modification.
>

Great :) Do you have any design document about how the rule format is going
to look like?
If the rules could be converted into a linked data format (like in
Wikidata) that would be fantastic. That would mean that you could have a
Wikibase repository on your Apertium wiki and that could be used as a
collaboration platform for writing and correcting rules. These rules can be
linked as annotations, so you can build domain specific rules and know in
which texts they are being used.


> Sure. Also, and I think I haven't mentioned this before, I have been
> looking at this as a way of making customised translators. What I
> envisage is that, on top of a common base of rules, and common lexicon
> extracted from Wiktionary, that domain-specific translators would be
> built: that at the very least, individual WikiProjects could specify
> their own custom rules and lexica, as divergences from the baseline. I
> was thinking that these could be tied into categories, but I don't
> have a good solution for conflicts between categories, yet.
>

What about reusing Wikidata (or DBpedia) items as a way of determining
which rules and lexica to use? What I am thinking is:
1. There is a repository (it could be Wikipedia articles or specific
websites) of domain-specific tagged texts and manually selected rules (all
this stored as semantic annotations). Each article has a set of identified
entities, a set of identified lexica, and rules.
2. The entities in the text-to-be-translated are identified (e.g.
Spotlight).
3. These entities are used to find related tagged texts and the rules and
lexica used in the matching texts are marked as preferred (assumed that
they exist as linked data)

It can be computationally expensive, but it would remove the need of manual
selection and categories.

On Mon, Jul 22, 2013 at 6:14 PM, Jimmy O'Regan <[email protected]> wrote:

> On 22 July 2013 20:03, David Cuenca <[email protected]> wrote:
> > Hi Jim, thanks a lot for taking a look to the proposal and for sharing
> your
> > thoughts.
> >
> > The Wiktionary+Wikidata future is currently in the making. There are two
> > main options (linked in the document) which will be discussed during the
> > Wikimania in Hong Kong [1] .There is nothing decided yet, other than the
> > need of semantic dictionary capabilities to support Wiktionary.
> Additionally
> > this summer there is a recent graduate in Computational Linguistics
> working
> > with the Wikidata team to evaluate the options.
> >
>
> I've followed some of the discussion, and this proposal
> (http://www.wikidata.org/wiki/Wikidata:Wiktionary) has definitely
> improved.
>
> > About multiword expressions, they can be probably decomposed into
> > sub-elements using the property "subdivides into". This could prove to be
> > quite powerful, since it would not only indicate what are the
> sub-elements,
> > but also which role and sense they have in the fragment. I agree with you
> > that fragments could be also included, either directly into Wikidata when
> > relevant, or in the (forecasted) database that should be handling
> > annotations.
>
> Yes, and the Lemon ontology could be a good example for how that could
> be represented in Wikidata, but there's also the issue of multiwords
> with inflection. From what I've seen of scribunto's wikidata handling,
> it seems like the perfect solution: individual forms can be generated
> based on what is known about each element, rather than writing out
> each form, or adding huge amounts of individual templates.
>
> > Prosody could also be handled as a mix of pre-processing and
> > annotations.
> >
>
> Sure, I was just thinking of it as an example of something that could
> be handled, that isn't :)
>
> > It shouldn't be that hard to integrate statistical information in
> Wikidata
> > once the semantic dictionary structure is in place. For instance I can
> well
> > imagine that a statement indicating a translation can have a
> sub-statement
> > ('qualifier' in Wikidata jargon) indicating the probability or other
> > relevant information. Although completely unrelated, in this example you
> can
> > see how the property "depicts" has this kind of sub-properties [2].
> >
>
> Yes, but citing statistically derived probabilities, particularly for
> bilingual alignments, is a little involved, and getting a truly useful
> level of detail, such as source corpus, tuning parameters, even
> tokenisation method used, may constitute original research.
>
> > Obviously DBpedia Spotlight cannot be used as it is now, something
> similar
> > should be built (upon or separately) to handle word, inflections, and
> > monolingual sense selection.
>
> On second thoughts... Spotlight is pretty flexible. It should be
> possible to get it to read and write Apertium's internal format, which
> would at least partially solve this.
>
> > Inflection forms should be contained in
> > Wikidata. Name inflections are tricky, but they might happen at a later
> > stage and (hopefully) in an automated or semi-automated way.
> >
>
> Sure; individually, they're not outside the scope of Wiktionary, and
> that should be enough for the majority of cases. (As I think more
> about a few of the things I mentioned in the last email, they seem not
> as big a deal :)
>
> > I'm glad that you are working on lexica and rule extraction from
> Wikimedia
> > templates, because your work could be reused whenever the new Wiktionary
> is
> > ready. Maybe a good thing to do as well would be to analyze how to store
> the
> > rules in a wiki website either handling them as code (the Scribunto
> > extension [3] can give some insights), as linked data (see the Wikibase
> > Repository extension [4] which is the one powering Wikidata), or
> something
> > in between.
>
> Going forward, I could definitely see either Scribunto or a dedicated
> extension taking the place of the extraction component (i.e., that the
> rules could be generated directly from Wikipedia).
>
> I started from a linked data (specifically, Lemon ontology)
> perspective, and moved over to wikimedia when I realised that it would
> be much more userfriendly to do it that way, rather than writing
> triples any other way.
>
> > Do you think this could also related to Boldizsár's project, a
> > visual interface to edit the underlying structure?
>
> I'm Boldizsár's mentor :) His work so far has been towards a more
> declarative way of writing rules, so outputting rules in the format I
> have in mind would be a trivial modification.
>
> >
> > The pros&cons sections still needs some expansion, and of course the
> impact
> > of MT on contributions is negligible. Still, it could be a potential
> concern
> > that should be mentioned. All in all, I think it should be also said that
> > eventually this way of pre-annotating the source text could be applied to
> > other websites and not just Wikipedia. That might make the proposal even
> > more attractive.
>
> Sure. Also, and I think I haven't mentioned this before, I have been
> looking at this as a way of making customised translators. What I
> envisage is that, on top of a common base of rules, and common lexicon
> extracted from Wiktionary, that domain-specific translators would be
> built: that at the very least, individual WikiProjects could specify
> their own custom rules and lexica, as divergences from the baseline. I
> was thinking that these could be tied into categories, but I don't
> have a good solution for conflicts between categories, yet.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>



-- 
Etiamsi omnes, ego non

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Machine translation for Wikipedia

Reply via email to