Re: [Apertium-stuff] Machine translation for Wikipedia

David Cuenca Tue, 23 Jul 2013 13:51:24 -0700

I like the example of a Wikimedia template, it looks more clear to me that
the current structure. Although I understand the concerns of taking a leap
too big, I still would advise to design the structure having in mind a
migration to linked data. If it is done right, it will be a trivial step
using the pywikipedia bot framework (harvest_template.py).
For contributors it has some advantages to work with the wikibase interface:
- it autocompletes from a controlled set of properties/elements (less work
typing)
- it lets you add or remove properties as you need them (less work
maintaining templates)
- you can see easier in the history which property was changed and to which
value (less patrolling effort)


It also can be noted that the template could be further simplified in some
cases if it is done in the context of Wik{tionary|idata}, linking the
"source example" and "target example" elements to their corresponding entry
and extracting from them their properties.
<source element> the (determiner)
<source element> big (adjective )
<source element> dog (noun)
<target element> el
<target element> perro
<target element> grande

It is possible to reach a higher level of granularity if there is a need
for it (in this case for prosody purposes):
<source element> the (determiner, stressed) Sense 5 of
https://en.wiktionary.org/wiki/the#Article
<source element> big (adjective )
<source element> dog (noun)

And maybe it would be easier to infer rules (for instance from the known
pairs en-es/fr-es, infer some simple rules for fr-en).

With the rest of your message I couldn't agree more.A rule modification
interface would simplify corrections and improvements quite a lot :)


On Tue, Jul 23, 2013 at 11:57 AM, Jimmy O'Regan <[email protected]> wrote:

> On 23 July 2013 14:44, David Cuenca <[email protected]> wrote:
> > On Mon, Jul 22, 2013 at 6:14 PM, Jimmy O'Regan <[email protected]>
> wrote:
> >>
> >> I'm Boldizsár's mentor :) His work so far has been towards a more
> >> declarative way of writing rules, so outputting rules in the format I
> >> have in mind would be a trivial modification.
> >
> >
> > Great :) Do you have any design document about how the rule format is
> going
> > to look like?
>
> I'm still weighing up the pros and cons of a couple of options, but
> they all revolve around expressing something like an SMT phrase table
> in wikimedia templates, so the basic elements are source, target, and
> alignment. Alignment is not always necessary: it can be inferred or
> deduced; it may be better in human terms to allow an example; some
> word classes may be specific to a set of rules such that it makes no
> sense to make them general; is it better to explicitly encode the
> language pair, or to inherit from a category; phrases need attributes
> from different elements, etc.
>
> All of that said, here's a basic sketch of what I have in mind:
>
> {{translation_rule
> | pair = en-es
> | phrase type = NP <!-- this could be determined from either
> source_head or target_head, but it's nicer to have it -->
> | source = determiner adjective noun
> | target = determiner noun adjective
> | alignment = 1 3 2 <!-- not necessary in this example, but would be
> if there were more than one of each PoS -->
> <!-- this could also be written as 1-1 3-2 2-3 -- it would be nice to
> be able to use that convention, to import statistically derived rules,
> but it's only necessary to know one set of positions when writing by
> hand -->
> | source head = 3
> | target head = 2 <!-- not necessary with alignment -->
> | source example = the big dog
> | target example = el perro grande
> | target attributes = {{attribs | definiteness = 1 | gender = 2, 1 |
> number = 2, 1}} <!-- the actual attributes would be those used in
> wiktionary -->
> }}
>
> > If the rules could be converted into a linked data format (like in
> Wikidata)
> > that would be fantastic. That would mean that you could have a Wikibase
> > repository on your Apertium wiki and that could be used as a
> collaboration
> > platform for writing and correcting rules. These rules can be linked as
> > annotations, so you can build domain specific rules and know in which
> texts
> > they are being used.
> >
>
> That's an 'eye on the horizon' aim. My primary aim is to get something
> that's not too hard to figure out (at least, in the easy cases) --
> something that can be understood by the open source localisation
> contributors.
>
> >>
> >> Sure. Also, and I think I haven't mentioned this before, I have been
> >> looking at this as a way of making customised translators. What I
> >> envisage is that, on top of a common base of rules, and common lexicon
> >> extracted from Wiktionary, that domain-specific translators would be
> >> built: that at the very least, individual WikiProjects could specify
> >> their own custom rules and lexica, as divergences from the baseline. I
> >> was thinking that these could be tied into categories, but I don't
> >> have a good solution for conflicts between categories, yet.
> >
> >
> > What about reusing Wikidata (or DBpedia) items as a way of determining
> which
> > rules and lexica to use? What I am thinking is:
>
> Determining automatically always carries with it a margin of error,
> and that's why I want to start from a basis of explicit control. But
> that's not to say I would rule out document classification or other
> automatic methods. Also, it's simply better to acknowledge from the
> outset that translation is domain specific. But, yes, DBpedia
> Spotlight has document classification facilities that can be used for
> that, and (IIRC) there's a GSoC project improving them.
>
> > 1. There is a repository (it could be Wikipedia articles or specific
> > websites) of domain-specific tagged texts and manually selected rules
> (all
> > this stored as semantic annotations). Each article has a set of
> identified
> > entities, a set of identified lexica, and rules.
>
> That's a little too manual for my liking :) Consider this: machine
> translation for Wiki (or blog) editing does not need to be completely
> automatic - human-assisted MT may be preferable - and the work on
> parsoid looks to me like doing something like the following would be
> not only possible, but relatively easy: say the user is translating
> from faux Spanish, and gets something like 'foobar de quux' that can
> be translated in a number of ways: 'quux's foobar', 'quux foobar',
> 'foobar of quux', etc. The text can be annotated at the chunk level,
> so when the user hovers over the phrase they see that a different
> number of possible rules could have applied. They click, and choose
> the one that was most appropriate. The editor, for reasons of
> transparency, should insert an indication that MT was used; and to cut
> down on RC patrolling of MT edits, it should also indicate whether or
> not the MT output was edited by the user. The latter can be used to
> extract such edits later, to create this kind of corpus.
>
> > 2. The entities in the text-to-be-translated are identified (e.g.
> > Spotlight).
>
> I would make this a two-phase process, at least for Wikipedia: where
> the entity is already contained in wikilinks, wikidata should be the
> first source for the translation (i.e., the entity can be considered
> to be disambiguated). Spotlight can use multiple annotators, and if
> there isn't a 'null' wikilink-based annotator already, adding one
> would be the matter of a few lines of code.
>
> > 3. These entities are used to find related tagged texts and the rules and
> > lexica used in the matching texts are marked as preferred (assumed that
> they
> > exist as linked data)
> >
> > It can be computationally expensive, but it would remove the need of
> manual
> > selection and categories.
>
> I like it. I don't think it would be that expensive -- at least, as
> these things go -- because it's basically a search problem. It could
> be done with a modified translation memory, or with Lucene, etc. It
> could quite easily be realised as an extra set of attributes in the
> index that Spotlight already uses, making it part of the existing
> lookup.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>



-- 
Etiamsi omnes, ego non

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Machine translation for Wikipedia

Reply via email to