Hi,
I'm working in a new tool (or should I say "option"?) for Apertium DixTools.
I had this idea while adding some new words to the monolingual
dictionaries, and I think that most of the non-expert contributors to
language pairs will realize the same: when they want to add a word, they
think of another word that will have the same paradigm, look for it, copy,
replace the lemma and the stem with the ones from the new word, and save
the new dix file.
While expert users may now that, for example, in Catalan the paradigm
for feminine nouns that make the plural adding a "s" is "abella__n", when I
want to add half a dozen of words after days (or weeks) without touching
the dictionaries, I have to look how the paradigm was called. And,
probably, to add 6 words I may need to look for 3-4 different paradigms.
There's a paper [1] by Miquel Esplà-Gomis, Víctor M. Sánchez-Cartagena and
Juan Antonio Pérez Ortiz where, with the help of a non-expert user, and
using a corpus, a tool tries to guess the paradigm of an unknown word to
add it to the dictionary. Francis Tyers also proposes [2] a system to learn
those paradigms in an unsupervised environment. But what I want to achieve
is a much easier task: avoid having to manually look for a paradigm name
when adding a new word if you know another word that follows it. So you
could think of a naive/dumb version of Miquel et al.'s tool.
To do that, I'm planning two "versions" of the tool: a supervised and a
batch mode. Right now, I have implemented the supervised mode, and it works
as follow (examples from the Catalan en-ca dictionary):
* The user adds a pair of words, being the first one an unknown word and
the second one a word already in the dictionary. I'm using , sorrounded
with whitespaces as a divider for clarity reasons
** assignatura | barrera
* The tool proposes one candidate (or more, in case the word is ambiguous,
and it's a name and an adjective)
** <e lm="barrera"><i>barrer</i><par n="abell/a__n"/></e>
* When the user accepts it, the tool "guesses" the most probable stem by
comparing the new word with the "existing" one and its stem, and "expands"
the new word according to it, asking for confirmation
** assignatura :assignatura<n><f><s>
** assignatures :assignatura<n><f><pl>
* If it's incorrect, the tool starts to show all possible stems, starting
by the full lemma and substracting a char each time, until the user accepts
one of them
* When accepted, the tool generates the entry and add it to the dictionary
** <e lm="assignatura"><i>assignatur</i><par n="abell/a__n"/></e>
The batch mode may have some restrictions, i.e. the "existing" word can't
be ambigous (have more than one PoS), but will allow to add a list of words
in a text file (CSV, for example).
Now my questions are:
* Do you find this is a useful tool?
* Does it already exists something similar (and I'm unaware of)?
* Do you have any suggestion for the name of this "tool"?
As I said, this tool may not be used by people that works on a daily basis
with the dictionaries, as I'm pretty sure they'll go faster by manually
editing the dictionaries; also for totally newbies it won't be useful
either, as (right now) the tool assumes some knowledge about the dictionary
formats. But I guess that with a tool like this, we could have a bigger
base of contributors to some language pairs.
[1] http://www.aclweb.org/anthology/R/R11/R11-1047.pdf
[2] http://wiki.apertium.org/wiki/Improved_corpus-based_paradigm_matching
Regards,
--
< Xavi Ivars >
< http://xavi.ivars.me >
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff