Hi Jim, thanks a lot for taking a look to the proposal and for sharing your
thoughts.
The Wiktionary+Wikidata future is currently in the making. There are two
main options (linked in the document) which will be discussed during the
Wikimania in Hong Kong [1] .There is nothing decided yet, other than the
need of semantic dictionary capabilities to support Wiktionary.
Additionally this summer there is a recent graduate in Computational
Linguistics working with the Wikidata team to evaluate the options.
About multiword expressions, they can be probably decomposed into
sub-elements using the property "subdivides into". This could prove to be
quite powerful, since it would not only indicate what are the sub-elements,
but also which role and sense they have in the fragment. I agree with you
that fragments could be also included, either directly into Wikidata when
relevant, or in the (forecasted) database that should be handling
annotations. Prosody could also be handled as a mix of pre-processing and
annotations.
It shouldn't be that hard to integrate statistical information in Wikidata
once the semantic dictionary structure is in place. For instance I can well
imagine that a statement indicating a translation can have a sub-statement
('qualifier' in Wikidata jargon) indicating the probability or other
relevant information. Although completely unrelated, in this example you
can see how the property "depicts" has this kind of sub-properties [2].
Obviously DBpedia Spotlight cannot be used as it is now, something similar
should be built (upon or separately) to handle word, inflections, and
monolingual sense selection. Inflection forms should be contained in
Wikidata. Name inflections are tricky, but they might happen at a later
stage and (hopefully) in an automated or semi-automated way.
I'm glad that you are working on lexica and rule extraction from Wikimedia
templates, because your work could be reused whenever the new Wiktionary is
ready. Maybe a good thing to do as well would be to analyze how to store
the rules in a wiki website either handling them as code (the Scribunto
extension [3] can give some insights), as linked data (see the Wikibase
Repository extension [4] which is the one powering Wikidata), or something
in between. Do you think this could also related to Boldizsár's project, a
visual interface to edit the underlying structure?
The pros&cons sections still needs some expansion, and of course the impact
of MT on contributions is negligible. Still, it could be a potential
concern that should be mentioned. All in all, I think it should be also
said that eventually this way of pre-annotating the source text could be
applied to other websites and not just Wikipedia. That might make the
proposal even more attractive.
[1] https://wikimania2013.wikimedia.org/wiki/Main_Page
[2] http://www.wikidata.org/wiki/Q12418
[3] https://www.mediawiki.org/wiki/Extension:Scribunto
[4] https://www.mediawiki.org/wiki/Extension:Wikibase_Repository
On Mon, Jul 22, 2013 at 11:55 AM, Jimmy O'Regan <[email protected]> wrote:
> On 22 July 2013 15:55, David Cuenca <[email protected]> wrote:
> > Hi there!
> >
> > I'm preparing a MT proposal for Wikipedia. I would like to get some
> feedback
> > from the Apertium group regarding its feasability, and potential
> problems.
> > This proposal relies on a semantic dictionary (future
> Wikidata/Wiktionary),
> > a semantic annotation system, a user interface and, of course, Apertium.
>
> I had meant to write back to you the last time you write, but I got
> bogged down in trying to follow what the Wiktionary+Wikidata future
> might look like.
>
> I do recall specifically that you linked to some discussion (mailing
> list?) where the issue was whether the basic unit should be word or
> expression (or something to that effect). IMO, how multiword
> expressions are handled in the future Wiktionary is quite important,
> as this can be generalised to a wiki-specific means of writing
> translation rules: 'the whole nine yards' as an expression can be the
> basis for a translation rule 'determiner adjective number noun', etc.,
> though at this point I should possibly take a step back: it's not
> inconceivable that a wiki-based mechanism could be used to add
> phrase-level information, such as parse fragments, which could then be
> used as an additional source of information for parsers, generating
> translation rules, perhaps even prosody for text-to-speech.
>
> I also would like to see how statistical information can be integrated
> into Wikidata, as this can be useful for translation. (Despite the
> apparent divide of rule-based vs. statistical, all modern RBMT systems
> use statistics in some form (even if it's as simple as a lexicographer
> comparing word frequencies to decide which is the best translation),
> and, conversely, all modern statistical systems use any number of what
> are effectively rules, from the explicit parse-derived rules of SAMT,
> to even the 'phrase templates' generated by word aligners).
>
> In case I've buried my point under the jargon :) -- I think it's not
> only possible to build translation systems via wikis, I think that it
> can be done in a way that's better for linguistic knowledge in
> general. I hope to have a prototype ready by September, that extracts
> lexica and rules specified in Wikimedia templates using the DBpedia
> extraction system, and generates first Apertium dictionaries and
> rules, later SAMT-style statistical rules.
>
> >
> >
> https://docs.google.com/document/d/1S-Ycqsyn9fMVqHQxiK7nfdKhuspW4l1OyIKh4a73kJg/edit?usp=sharing
> >
> > I know that currently Apertium is based on xml files, so I don't know how
> > much effort would it take to adapt it to interact with a semantic
> dictionary
> > plus annotations.
>
> The biggest precondition would be that entries coming from, say,
> DBpedia Spotlight would need to have equivalent grammatical
> annotations sufficient to be processed as 'native' dictionary entries.
> Without gender etc. information, we can't perform grammatical
> agreements, so there would be grammatical errors. If such information
> was available (e.g., from Wiktionary), then that's doable. The bigger
> problem is for the named entities in Wikipedia: for people, there are
> ways of determining gender, but other matters of inflection are not
> quite as easily determined: that the genitive of 'Barack Obama' in
> Polish is 'Baracka Obamy' is something that's not encoded in
> Wiktionary. (There may be possibilities based on mining wiki links,
> and extrapolating, but this is an untested area).
>
> 'Since content in other languages might be more readily available, the
> incentive to write content in someone’s own language might be less
> pressing'
>
> I would have thought that Wikipedia's experience with MT-based
> contributions to date would show this to be false.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
--
Etiamsi omnes, ego non
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff