Re: GSoC 2015 - WSD Module

2015-05-22 Thread Joern Kottmann
Hello,

one of the tasks we should start is, is to define the interface for the WSD
component.

Please have a look at the other components in OpenNLP and try to propose an
interface in a similar style.
Can we use one interface for all the different implementations?

Jörn


On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi 
mondher.bouaz...@gmail.com wrote:

 Dear all,

 Sorry if you received multiple copies of this email (The links were
 embedded). Here are the actual links:

 *Figure:*

 https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
 *Semeval/senseval results summary:*

 https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
 *Literature survey of WSD techniques:*

 https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

 Yours faithfully

 On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
 anthonybeyler...@hotmail.com wrote:

  Please excuse the duplicate email, we could not attach the mentioned
  figure.
  Kindly find it here.
  Thank you.
 
  From: anthonybeyler...@hotmail.com
  To: dev@opennlp.apache.org
  Subject: GSoC 2015 - WSD Module
  Date: Mon, 18 May 2015 22:14:43 +0900
 
 
 
 
  Dear all,
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid) - WSD is used for different
 directly
  related objectives such as all-words disambiguation, lexical sample
  disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
 seem
  to be good references to compare different techniques for WSD since many
 of
  them were tested on the same data (but different one each event).- For
 the
  sake of making a first solution, we propose to start with supporting the
  lexical sample type of disambiguation, meaning to disambiguate
  single/limited word(s) from an input text.
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.Otherwise we have also collected the
 results
  of all the senseval/semeval exercises here.(Note that each document has
  many sheets)The collected results, could help decide on which techniques
 to
  start with as main models for each set of techniques
  (supervised/unsupervised).
  We also propose a general approach for the package in the figure
  attached.The main components are as follows :
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.However, we would also like to allow the users to use
 their
  own local resources, by maybe defining a type of connector to the
 resource
  interface.
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.We
  might even later consider building a local cache for remote services.
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.These techniques will
  be split into two main packages as in the left side of the figure :
  Supervised/Unsupervised.The utils package includes common tools used in
  both types of techniques.The details mentioned in each package should be
  common to all implementations of these abstract models.
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi
 



Re: GSoC 2015 - WSD Module

2015-05-22 Thread Rodrigo Agerri
Hello Mondher (my response is about supervised WSD),

Thanks for the info, it is quite interesting. Apart from the comment
by Jörn, which I think is very important if we want to achieve
something given the time constrains of the GSOC, I have a couple of
recommendations/comments from my part:

1. Rather than targeting Lexical Sample task or all words WSD I think
it could be more operative to choose an approach/algorithm and try to
implement it in OpenNLP. One of the most (it not the most) popular
approaches is the it Makes Sense (IMS) system

http://www.comp.nus.edu.sg/~nlp/sw/README.txt
https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

That I think is achievable in the GSOC time frame.

2. As an aside, research has been moving towards supersense tagging
(SST), given the dificulty of WSD.

http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

As you can see in the above paper, SST is approached as a sequence
labelling task, rather than classification. This means that we could
reimplement Ciaramita and Altun (2006) features implementing the
AdaptiveFeatureGenerators and creating a module structurally similar
to the NameFinder but for SST.

This has also the advantage of being able to move to datasets that are
not old Semcor and senseval and using current Tweet datasets and so
on. See this recent paper on SST on tweets:

http://aclweb.org/anthology/S14-1001

I think that for supervised WSD, we should pursue option 1. or 2. and
start definining the interface as Jörn has suggested.

Best,

Rodrigo

On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
anthonybeyler...@hotmail.com wrote:
 Dear all,

 In the context of building a Word Sense Disambiguation (WSD) module, after
 doing a survey on WSD techniques, we realized the following points :

 - WSD techniques can be split into three sets (supervised,
 unsupervised/knowledge based, hybrid)

 - WSD is used for different directly related objectives such as all-words
 disambiguation, lexical sample disambiguation, multi/cross-lingual
 approaches etc.

 - Senseval/Semeval seem to be good references to compare different
 techniques for WSD since many of them were tested on the same data (but
 different one each event).

 - For the sake of making a first solution, we propose to start with
 supporting the lexical sample type of disambiguation, meaning to
 disambiguate single/limited word(s) from an input text.


 Therefore, we have decided to collect information about the different
 techniques in the literature (such as  references, performance, parameters
 etc.) in this spreadsheet here.
 Otherwise we have also collected the results of all the senseval/semeval
 exercises here.
 (Note that each document has many sheets)
 The collected results, could help decide on which techniques to start with
 as main models for each set of techniques (supervised/unsupervised).

 We also propose a general approach for the package in the figure attached.
 The main components are as follows :

 1- The different resources publicly available : WordNet, BabelNet,
 Wikipedia, etc.
 However, we would also like to allow the users to use their own local
 resources, by maybe defining a type of connector to the resource interface.

 2- The resource interface will have the role to provide both a sense
 inventory that the user can query and a knowledge base (such as semantic or
 syntactic info. etc.) that might be used depending on the technique.
 We might even later consider building a local cache for remote services.

 3- The WSD algorithms/techniques themselves that will make use of the
 resource interface to access the resources required.
 These techniques will be split into two main packages as in the left side of
 the figure :  Supervised/Unsupervised.
 The utils package includes common tools used in both types of techniques.
 The details mentioned in each package should be common to all
 implementations of these abstract models.

 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
 structure following your recommendations.

 If you have any suggestions or recommendations, we would really appreciate
 discussing them and would like your guidance to iterate on this tool-set.

 Best regards,

 Anthony Beylerian, Mondher Bouazizi


Re: W2VClassesDictionary class

2015-05-22 Thread Rodrigo Agerri
Hello,

You are right I kept it there while I was doing the test with the
WordClusterFeatureGenerator. I will remove it.

Best,

R

On Fri, May 22, 2015 at 1:51 PM, Joern Kottmann kottm...@gmail.com wrote:
 Hello,

 looks like this class was renamed into WordClusterDictionary.

 Can the class W2VClassesDictionary be removed?
 We shouldn't include it in RC4 when it is not necessary.

 Thanks,
 Jörn


W2VClassesDictionary class

2015-05-22 Thread Joern Kottmann
Hello,

looks like this class was renamed into WordClusterDictionary.

Can the class W2VClassesDictionary be removed?
We shouldn't include it in RC4 when it is not necessary.

Thanks,
Jörn


OpenNLP RC4

2015-05-22 Thread Joern Kottmann
Hello,

we should now be in a good state to do RC4. We finally solved
the performance problems with the parser and a couple
of very minor things where fixed as well (e.g NOTICE file update).

A major addition since RC3 are the automated evaluation tests
to speed up our release process. I hope this will significantly reduce
the amount of time required to ensure RC4 is working properly.

Jörn


Re: GSoC 2015 - WSD Module

2015-05-22 Thread Mondher Bouazizi
Hi all,

Thanks Rodrigo for the feedback.
I don't mind starting with IMS implementation as a first supervised
solution.
It seems to a good first step.
As for the SST, I will read more about it and will let you know.

On the other hand, how about the following interface Anthony and myself
prepared based on Jörn's recommendation.
We tried to be as close as possible to the other tools already implemented.

Link :
https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing

Best regards,

Mondher, Anthony



On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello Mondher (my response is about supervised WSD),

 Thanks for the info, it is quite interesting. Apart from the comment
 by Jörn, which I think is very important if we want to achieve
 something given the time constrains of the GSOC, I have a couple of
 recommendations/comments from my part:

 1. Rather than targeting Lexical Sample task or all words WSD I think
 it could be more operative to choose an approach/algorithm and try to
 implement it in OpenNLP. One of the most (it not the most) popular
 approaches is the it Makes Sense (IMS) system

 http://www.comp.nus.edu.sg/~nlp/sw/README.txt
 https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

 That I think is achievable in the GSOC time frame.

 2. As an aside, research has been moving towards supersense tagging
 (SST), given the dificulty of WSD.

 http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

 As you can see in the above paper, SST is approached as a sequence
 labelling task, rather than classification. This means that we could
 reimplement Ciaramita and Altun (2006) features implementing the
 AdaptiveFeatureGenerators and creating a module structurally similar
 to the NameFinder but for SST.

 This has also the advantage of being able to move to datasets that are
 not old Semcor and senseval and using current Tweet datasets and so
 on. See this recent paper on SST on tweets:

 http://aclweb.org/anthology/S14-1001

 I think that for supervised WSD, we should pursue option 1. or 2. and
 start definining the interface as Jörn has suggested.

 Best,

 Rodrigo

 On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
 anthonybeyler...@hotmail.com wrote:
  Dear all,
 
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
 
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid)
 
  - WSD is used for different directly related objectives such as all-words
  disambiguation, lexical sample disambiguation, multi/cross-lingual
  approaches etc.
 
  - Senseval/Semeval seem to be good references to compare different
  techniques for WSD since many of them were tested on the same data (but
  different one each event).
 
  - For the sake of making a first solution, we propose to start with
  supporting the lexical sample type of disambiguation, meaning to
  disambiguate single/limited word(s) from an input text.
 
 
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.
  Otherwise we have also collected the results of all the senseval/semeval
  exercises here.
  (Note that each document has many sheets)
  The collected results, could help decide on which techniques to start
 with
  as main models for each set of techniques (supervised/unsupervised).
 
  We also propose a general approach for the package in the figure
 attached.
  The main components are as follows :
 
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.
  However, we would also like to allow the users to use their own local
  resources, by maybe defining a type of connector to the resource
 interface.
 
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.
  We might even later consider building a local cache for remote services.
 
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.
  These techniques will be split into two main packages as in the left
 side of
  the figure :  Supervised/Unsupervised.
  The utils package includes common tools used in both types of techniques.
  The details mentioned in each package should be common to all
  implementations of these abstract models.
 
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
 
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
 
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi