Hi,
I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to
be supported, but would like your recommendations.
Here are some notes :
1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a
parameter list to fill or make separate classes for each, or otherwise
following your preference.
6- The other classes are for convenience.
We will try to patch frequently on the separate issues, following the feedback.
Best regards,
Anthony
Date: Wed, 10 Jun 2015 11:42:56 +0200
Subject: Re: GSoC 2015 - WSD Module
From: kottm...@gmail.com
To: dev@opennlp.apache.org
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.
Please try to submit the patch as quickly as possible.
Jörn
On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:
Hello,
On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
mondher.bouaz...@gmail.com wrote:
Dear Rodrigo,
As Anthony mentioned in his previous email, I already started the
implementation of the IMS approach. The pre-processing and the extraction
of features have already been finished. Regarding the approach itself, it
shows some potential according to the author though the features proposed
are not so many, and are basic.
Hi, yes, the features are not that complex, but it is good to have a
working system and then if needed the feature set can be
improved/enriched. As stated in the paper, the IMS approach leverages
parallel data to obtain state of the art results in both lexical
sample and all words for senseval 3 and semeval 2007 datasets.
I think it will be nice to have a working system with this algorithm
as part of the WSD component in OpenNLP (following the API discussion
previous in this thread) and perform some evaluations to know where
the system is with respect to state of the art results in those
datasets. Once this is operative, I think it will be a good moment to
start discussing additional/better features.
I think the approach itself might be
enhanced if we add more context specific features from some other
approaches... (To do that, I need to run many experiments using different
combinations of features, however, that should not be a problem).
Speaking about the feature sets, in the API google doc I have not seen
anything about the implementation of the feature extractors, could you
perhaps provide some extra info (in that same document, for example)
about that?
But the approach itself requires a linear SVM classifier, and as far as I
know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
libsvm
?
I think you can try with a MaxEnt to start with and in the meantime,
@Jörn has commented sometimes that there is a plugin component in
OpenNLP to use third-party ML libraries and that he tested it with
Mallet. Perhaps he could comment on this to use that functionality to
use SVMs.
Regarding the training data, I started collecting some from different
sources. Most of the existing rich corpora are licensed (Including the
ones
mentioned in the paper). The free ones I got for now are from the
Senseval
and Semeval websites. However, these are used just to evaluate the
proposed
methods in the workshops. Therefore, the words to disambiguate are few in
number though the training data for each word are rich enough.
In any case, the first tests with Senseval and Semeval collected should
be
finished soon. However, I am not sure if there is a rich enough Dataset
we
can use to make our model for the WSD module in the OpenNLP library.
If you have any recommendation, I would be grateful if you can help me on
this point.
Well, as I said in my previous email, research around word senses is
moving from WSD towards Supersense tagging where there are recent
papers and freely available tweet datasets, for example. In any case,
we can look more into it but in the meantime the Semcor for training
and senseval/semeval2007 datasets for evaluation should be enough to
compare your system with the literature.
As Jörn mentioned sending an initial patch, should we separate our codes
and upload two different patches to the two issues we created on the Jira
(however, this means a lot of redundancy in the code), or shall we keep
them in one project and upload it? If we opt for the latter case,