RE: GSoC 2015 - WSD Module

2015-06-10 Thread Anthony Beylerian
Hi,

I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to 
be supported, but would like your recommendations.
Here are some notes : 

1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the 
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a 
parameter list to fill or make separate classes for each, or otherwise 
following your preference.
6- The other classes are for convenience.

We will try to patch frequently on the separate issues, following the feedback.

Best regards,

Anthony

 Date: Wed, 10 Jun 2015 11:42:56 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 You can attach the patch to one of the issues, you can create an new issue.
 In the end it doesn't matter much, but important is that we make progress
 here and get the initial code into our repository. Subsequent changes can
 then be done in a patch series.
 
 Please try to submit the patch as quickly as possible.
 
 Jörn
 
 On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:
 
  Hello,
 
  On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
  mondher.bouaz...@gmail.com wrote:
   Dear Rodrigo,
  
   As Anthony mentioned in his previous email, I already started the
   implementation of the IMS approach. The pre-processing and the extraction
   of features have already been finished. Regarding the approach itself, it
   shows some potential according to the author though the features proposed
   are not so many, and are basic.
 
  Hi, yes, the features are not that complex, but it is good to have a
  working system and then if needed the feature set can be
  improved/enriched. As stated in the paper, the IMS approach leverages
  parallel data to obtain state of the art results in both lexical
  sample and all words for senseval 3 and semeval 2007 datasets.
 
  I think it will be nice to have a working system with this algorithm
  as part of the WSD component in OpenNLP (following the API discussion
  previous in this thread) and perform some evaluations to know where
  the system is with respect to state of the art results in those
  datasets. Once this is operative, I think it will be a good moment to
  start discussing additional/better features.
 
   I think the approach itself might be
   enhanced if we add more context specific features from some other
   approaches... (To do that, I need to run many experiments using different
   combinations of features, however, that should not be a problem).
 
  Speaking about the feature sets, in the API google doc I have not seen
  anything about the implementation of the feature extractors, could you
  perhaps provide some extra info (in that same document, for example)
  about that?
 
   But the approach itself requires a linear SVM classifier, and as far as I
   know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
  libsvm
   ?
 
  I think you can try with a MaxEnt to start with and in the meantime,
  @Jörn has commented sometimes that there is a plugin component in
  OpenNLP to use third-party ML libraries and that he tested it with
  Mallet. Perhaps he could comment on this to use that functionality to
  use SVMs.
 
  
   Regarding the training data, I started collecting some from different
   sources. Most of the existing rich corpora are licensed (Including the
  ones
   mentioned in the paper). The free ones I got for now are from the
  Senseval
   and Semeval websites. However, these are used just to evaluate the
  proposed
   methods in the workshops. Therefore, the words to disambiguate are few in
   number though the training data for each word are rich enough.
  
   In any case, the first tests with Senseval and Semeval collected should
  be
   finished soon. However, I am not sure if there is a rich enough Dataset
  we
   can use to make our model for the WSD module in the OpenNLP library.
   If you have any recommendation, I would be grateful if you can help me on
   this point.
 
  Well, as I said in my previous email, research around word senses is
  moving from WSD towards Supersense tagging where there are recent
  papers and freely available tweet datasets, for example. In any case,
  we can look more into it but in the meantime the Semcor for training
  and senseval/semeval2007 datasets for evaluation should be enough to
  compare your system with the literature.
 
  
   As Jörn mentioned sending an initial patch, should we separate our codes
   and upload two different patches to the two issues we created on the Jira
   (however, this means a lot of redundancy in the code), or shall we keep
   them in one project and upload it? If we opt for the latter case, 

Re: GSoC 2015 - WSD Module

2015-06-10 Thread Joern Kottmann
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.

Please try to submit the patch as quickly as possible.

Jörn

On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello,

 On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
 mondher.bouaz...@gmail.com wrote:
  Dear Rodrigo,
 
  As Anthony mentioned in his previous email, I already started the
  implementation of the IMS approach. The pre-processing and the extraction
  of features have already been finished. Regarding the approach itself, it
  shows some potential according to the author though the features proposed
  are not so many, and are basic.

 Hi, yes, the features are not that complex, but it is good to have a
 working system and then if needed the feature set can be
 improved/enriched. As stated in the paper, the IMS approach leverages
 parallel data to obtain state of the art results in both lexical
 sample and all words for senseval 3 and semeval 2007 datasets.

 I think it will be nice to have a working system with this algorithm
 as part of the WSD component in OpenNLP (following the API discussion
 previous in this thread) and perform some evaluations to know where
 the system is with respect to state of the art results in those
 datasets. Once this is operative, I think it will be a good moment to
 start discussing additional/better features.

  I think the approach itself might be
  enhanced if we add more context specific features from some other
  approaches... (To do that, I need to run many experiments using different
  combinations of features, however, that should not be a problem).

 Speaking about the feature sets, in the API google doc I have not seen
 anything about the implementation of the feature extractors, could you
 perhaps provide some extra info (in that same document, for example)
 about that?

  But the approach itself requires a linear SVM classifier, and as far as I
  know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
 libsvm
  ?

 I think you can try with a MaxEnt to start with and in the meantime,
 @Jörn has commented sometimes that there is a plugin component in
 OpenNLP to use third-party ML libraries and that he tested it with
 Mallet. Perhaps he could comment on this to use that functionality to
 use SVMs.

 
  Regarding the training data, I started collecting some from different
  sources. Most of the existing rich corpora are licensed (Including the
 ones
  mentioned in the paper). The free ones I got for now are from the
 Senseval
  and Semeval websites. However, these are used just to evaluate the
 proposed
  methods in the workshops. Therefore, the words to disambiguate are few in
  number though the training data for each word are rich enough.
 
  In any case, the first tests with Senseval and Semeval collected should
 be
  finished soon. However, I am not sure if there is a rich enough Dataset
 we
  can use to make our model for the WSD module in the OpenNLP library.
  If you have any recommendation, I would be grateful if you can help me on
  this point.

 Well, as I said in my previous email, research around word senses is
 moving from WSD towards Supersense tagging where there are recent
 papers and freely available tweet datasets, for example. In any case,
 we can look more into it but in the meantime the Semcor for training
 and senseval/semeval2007 datasets for evaluation should be enough to
 compare your system with the literature.

 
  As Jörn mentioned sending an initial patch, should we separate our codes
  and upload two different patches to the two issues we created on the Jira
  (however, this means a lot of redundancy in the code), or shall we keep
  them in one project and upload it? If we opt for the latter case, which
  issue should we upload the patch to ?

 In my opinion, it should be the same patch and same Component with
 different algorithm implementations within it. Any other opinions?

 Cheers,

 Rodrigo