Re: [Apertium-stuff] Machine translation for Wikipedia

David Cuenca Tue, 23 Jul 2013 18:24:27 -0700

The automatic rule generation sounds like a great plan and I hope it works
out as well as you expect.


Is there any example of DBpedia mapping template to take a look at?

Interesting that there is already something done to select from several
translations :)

And the LanguageTool looks like a possible plugin to the newly deployed
VisualEditor. It would be great that the improvements done thanks to
analyzing Wikipedia edits could be fed back to it :)

On Tue, Jul 23, 2013 at 8:12 PM, Jimmy O'Regan <[email protected]> wrote:

> On 23 July 2013 21:49, David Cuenca <[email protected]> wrote:
> > I like the example of a Wikimedia template, it looks more clear to me
> that
> > the current structure. Although I understand the concerns of taking a
> leap
> > too big, I still would advise to design the structure having in mind a
> > migration to linked data.
>
> I think I mentioned that I'd been planning to use the DBpedia
> extraction framework (which generates RDF) -- the plan is to generate
> triples using Lemon and OLIA, and generate the rules from RDF.
> Eventually, when there's a stable Wikidata schema, I'd plan to use
> that, and Scribunto and/or a dedicated extension to generate the rules
> directly, but RDF is involved from the outset, so it will be linked
> data-ready.
>
> > If it is done right, it will be a trivial step
> > using the pywikipedia bot framework (harvest_template.py).
>
> Maybe, but I'm not touching Python with a bargepole.
>
> > For contributors it has some advantages to work with the wikibase
> interface:
> > - it autocompletes from a controlled set of properties/elements (less
> work
> > typing)
> > - it lets you add or remove properties as you need them (less work
> > maintaining templates)
> > - you can see easier in the history which property was changed and to
> which
> > value (less patrolling effort)
> >
>
> DBpedia's extraction framework has a more robust parser, and the
> mapping templates can be written on the wiki itself, which makes it a
> better option IMO. Also, it's written in Scala, which is a language I
> like to write in :)
>
> > It also can be noted that the template could be further simplified in
> some
> > cases if it is done in the context of Wik{tionary|idata}, linking the
> > "source example" and "target example" elements to their corresponding
> entry
> > and extracting from them their properties.
> > <source element> the (determiner)
> > <source element> big (adjective )
> > <source element> dog (noun)
> > <target element> el
> > <target element> perro
> > <target element> grande
> >
>
> With the editing additions I have in mind for translation, most simple
> rules would only need to have the examples written: the same
> underlying mechanisms could be used to infer the underlying part of
> speech, and from that the alignment -- i.e., as it would be looking up
> translations, it can see that 'the' has the translation 'el', so it
> can present 'determiner' based on the intersection of both, as well as
> introducing an alignment. But the manual template editing process is
> needed to get to that point. Actually, I'm becoming more convinced
> that I'll need to (re)implement at least one whole translator to
> really iron out the initial details, but that's only a matter of a
> week or two for, say, English-Spanish (which I maintain, and know
> inside out).
>
> > It is possible to reach a higher level of granularity if there is a need
> for
> > it (in this case for prosody purposes):
>
> Eh... I had mentioned prosody as something desirable for multiwords,
> but I had been thinking in monolingual terms, rather than in terms of
> translation. The translation case may be the concrete use case to
> convince the general population of Wiktionary users that adding, say,
> parse annotation to a multiword template is worthwhile; the side
> effect of having extra data to use to train statistical parsers could
> then be used to support prosody, and thus have extra data for
> text-to-speech... slowly migrate Wiktionary+Wikidata as a general
> repository for linguistic data that isn't _generally_ thought of as
> something that's part of a dictionary -- though I can certainly see
> how knowing where to place the stress in a phrase can be helpful in a
> dictionary.
>
> But, as you mention it, I think there may be cases where prosody can
> be translation specific, and it can definitely be useful when the user
> is using text-to-speech to read the translation, so... good call. But
> I'd leave it 'til later :)
>
> > <source element> the (determiner, stressed) Sense 5 of
> > https://en.wiktionary.org/wiki/the#Article
> > <source element> big (adjective )
> > <source element> dog (noun)
> >
> > And maybe it would be easier to infer rules (for instance from the known
> > pairs en-es/fr-es, infer some simple rules for fr-en).
>
> Sure. Triangulation was the initial motivation for a declarative rule
> syntax, but the ease of understanding is now my primary concern. Also,
> keeping it close to the output of SMT training means it'll be easier
> to import that data, and then triangulate that too.
>
> >
> > With the rest of your message I couldn't agree more.A rule modification
> > interface would simplify corrections and improvements quite a lot :)
>
> I hadn't been thinking about translation editing as rule modification,
> per se; just that we can mark the edits as they happen for the purpose
> of extraction (two GSoC students I co-mentored in past years worked on
> something similar[1], but we don't have a proper webmaster so it's not
> available on the website). Insofar as the text would be annotated in a
> way that allows alternative translations to be presented, that's
> primarily motivated by wanting the translator to have as much machine
> assistance as possible. I just happen to know that it can be put to
> other uses :)
>
> [1] At my insistence: Marcin Miłkowski (of LanguageTool) had used
> Wikipedia edits to generate grammar correction rules; this seemed a
> natural application of more or less the same idea. (I also insisted
> that they integrate LanguageTool for grammar checking :)
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>



-- 
Etiamsi omnes, ego non

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Machine translation for Wikipedia

Reply via email to