Re: GSoC 2015 - WSD Module

2015-07-09 Thread Joern Kottmann
Please open a jira issues for this, and for other GSOC tasks.
I would like to use jira to plan the outstanding tasks.

Are you working on this currently?

Jörn

On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
> Dear Jörn,
> Thank you for that.
> 
> After further surveying, I was thinking of beginning the implementation of an 
> approach based on context clustering as a next step.
> Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
> dataset [2].Since clustering is usually done using K-means, which could take 
> some time with large data, this was already done previously and the results 
> were made publicly available in [3] with up to 20 closest clusters per 
> "phrase".
> The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
> described in their paper.I believe this is straight-forward enough to 
> implement as another unsupervised approach for the proposed time-frame.
> Would like your opinion.
> Regards,
> Anthony
> [1] http://nlp.cs.rpi.edu/paper/wsd.pdf[2] 
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html[3] 
> http://webdocs.cs.ualberta.ca/~bergsma/PhrasalClusters/
> 
> 
> > Date: Fri, 19 Jun 2015 16:41:20 +0200
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> > 
> > Hello,
> > 
> > I will dedicate time tonight to get this pulled in the sandbox and will
> > then also provide some feedback.
> > We can then create new patches against the sandbox to fix further issues.
> > 
> > Jörn
> > 
> > On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian <
> > anthonybeyler...@hotmail.com> wrote:
> > 
> > > Thank you for the reply, I am guessing for now we will use the other
> > > sources.
> > >
> > > By the way, I  have uploaded a newer patch on the same issue [1].
> > > Would like to know if the approach to set parameters is acceptable.
> > >
> > > Also, we are referencing to some model files locally like tokenizer,
> > > tagger, etc because we need them for the preprocessing chain.for example :
> > >
> > > ++
> > > private static String modelsDir =
> > > "src\\test\\resources\\opennlp\\tools\\disambiguator\\";
> > >
> > > TokenizerModel  tokenizerModel = new TokenizerModel(new
> > > FileInputStream(modelsDir + "en-token.bin"));tokenizer = new
> > > TokenizerME(tokenizerModel);
> > > ++
> > >
> > > Thought of adding these files (.bin) in the test folder, but could anyone
> > > recommend a more elegant way  to do this ?
> > > Thanks !
> > >
> > > Anthony
> > >
> > > [1] : https://issues.apache.org/jira/browse/OPENNLP-758
> > >
> > >
> > > > From: rage...@apache.org
> > > > Date: Fri, 19 Jun 2015 10:18:12 +0200
> > > > Subject: Re: GSoC 2015 - WSD Module
> > > > To: dev@opennlp.apache.org
> > > >
> > > > Thanks for the update and the updated patch.
> > > >
> > > > With respect to the licensing of BabelNet, I do not think we can
> > > > redistribute CC BY-NC-SA resources here, but others in this project
> > > > and Apache in general will probably know better than me.
> > > >
> > > > Best,
> > > >
> > > > Rodrigo
> 



signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-30 Thread Joern Kottmann
Can you please open some jira issues so we can better keep track of what
has to be done.

Jörn
On Jun 28, 2015 10:23 PM, "Joern Kottmann"  wrote:

> Yes, the performance testing has to be there, otherwise it is hard to
> tell if it works or not.
>
> Jörn
>
> On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
> > Dear Jörn,
> >
> > As a first milestone, for now we have the main interface with two
> implementations (one unsupervised, one supervised), maybe we can add an
> evaluator for performance tests and comparison with the test data we
> currently have (SemEval, SensEval test sets).
> >
> > Best,
> >
> > Anthony
> >
> > > Subject: Re: GSoC 2015 - WSD Module
> > > From: kottm...@gmail.com
> > > To: dev@opennlp.apache.org
> > > Date: Thu, 25 Jun 2015 21:47:22 +0200
> > >
> > > On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
> > > > Hi,
> > > >
> > > > I attached an initial patch to OPENNLP-758.
> > > > However, we are currently modifying things a bit since many
> approaches need to be supported, but would like your recommendations.
> > > > Here are some notes :
> > > >
> > > > 1 - We used extJWNL
> > > > 2- [WSDisambiguator] is the main interface
> > > > 3- [Loader] loads the resources required
> > > > 4- Please check [FeaturesExtractor] for the mentioned methods by
> Rodrigo.
> > > > 5- [Lesk] has many variants, we already implemented some, but
> wondering on the preferred way to switch from one to the other:
> > > > As of now we use one of them as default, but we thought of either
> making a parameter list to fill or make separate classes for each, or
> otherwise following your preference.
> > > > 6- The other classes are for convenience.
> > > >
> > > > We will try to patch frequently on the separate issues, following
> the feedback.
> > >
> > >
> > > Sounds good, I reviewed it and think what we have is quite ok.
> > >
> > > Most important now is to fix the smaller issues (see the jira issue)
> and
> > > explain to us how it can be run.
> > >
> > > The midterm evaluation is coming up next week as well.
> > >
> > > How are we standing with the milstone we set?
> > >
> > > Jörn
> > >
> >
>
>


Re: GSoC 2015 - WSD Module

2015-06-28 Thread Joern Kottmann
Yes, the performance testing has to be there, otherwise it is hard to
tell if it works or not.

Jörn

On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
> Dear Jörn,
> 
> As a first milestone, for now we have the main interface with two 
> implementations (one unsupervised, one supervised), maybe we can add an 
> evaluator for performance tests and comparison with the test data we 
> currently have (SemEval, SensEval test sets).  
> 
> Best,
> 
> Anthony
> 
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> > Date: Thu, 25 Jun 2015 21:47:22 +0200
> > 
> > On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
> > > Hi,
> > > 
> > > I attached an initial patch to OPENNLP-758.
> > > However, we are currently modifying things a bit since many approaches 
> > > need to be supported, but would like your recommendations.
> > > Here are some notes : 
> > > 
> > > 1 - We used extJWNL
> > > 2- [WSDisambiguator] is the main interface
> > > 3- [Loader] loads the resources required
> > > 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
> > > 5- [Lesk] has many variants, we already implemented some, but wondering 
> > > on the preferred way to switch from one to the other:
> > > As of now we use one of them as default, but we thought of either making 
> > > a parameter list to fill or make separate classes for each, or otherwise 
> > > following your preference.
> > > 6- The other classes are for convenience.
> > > 
> > > We will try to patch frequently on the separate issues, following the 
> > > feedback.
> > 
> > 
> > Sounds good, I reviewed it and think what we have is quite ok.
> > 
> > Most important now is to fix the smaller issues (see the jira issue) and
> > explain to us how it can be run.
> > 
> > The midterm evaluation is coming up next week as well.
> > 
> > How are we standing with the milstone we set?
> > 
> > Jörn
> > 
> 



signature.asc
Description: This is a digitally signed message part


RE: GSoC 2015 - WSD Module

2015-06-28 Thread Anthony Beylerian
Dear Jörn,

As a first milestone, for now we have the main interface with two 
implementations (one unsupervised, one supervised), maybe we can add an 
evaluator for performance tests and comparison with the test data we currently 
have (SemEval, SensEval test sets).  

Best,

Anthony

> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> Date: Thu, 25 Jun 2015 21:47:22 +0200
> 
> On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
> > Hi,
> > 
> > I attached an initial patch to OPENNLP-758.
> > However, we are currently modifying things a bit since many approaches need 
> > to be supported, but would like your recommendations.
> > Here are some notes : 
> > 
> > 1 - We used extJWNL
> > 2- [WSDisambiguator] is the main interface
> > 3- [Loader] loads the resources required
> > 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
> > 5- [Lesk] has many variants, we already implemented some, but wondering on 
> > the preferred way to switch from one to the other:
> > As of now we use one of them as default, but we thought of either making a 
> > parameter list to fill or make separate classes for each, or otherwise 
> > following your preference.
> > 6- The other classes are for convenience.
> > 
> > We will try to patch frequently on the separate issues, following the 
> > feedback.
> 
> 
> Sounds good, I reviewed it and think what we have is quite ok.
> 
> Most important now is to fix the smaller issues (see the jira issue) and
> explain to us how it can be run.
> 
> The midterm evaluation is coming up next week as well.
> 
> How are we standing with the milstone we set?
> 
> Jörn
> 
  

Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
> Dear Jörn,
> Thank you for that.
> 
> After further surveying, I was thinking of beginning the implementation of an 
> approach based on context clustering as a next step.
> Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
> dataset [2].Since clustering is usually done using K-means, which could take 
> some time with large data, this was already done previously and the results 
> were made publicly available in [3] with up to 20 closest clusters per 
> "phrase".
> The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
> described in their paper.I believe this is straight-forward enough to 
> implement as another unsupervised approach for the proposed time-frame.
> Would like your opinion.

Your users can just download the dataset and do the clustering them
self. It should be possible to do that anyway. All the code necessary to
do that should be available as part of your contribution.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
> Dear Jörn,
> Thank you for that.
> 
> After further surveying, I was thinking of beginning the implementation of an 
> approach based on context clustering as a next step.
> Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
> dataset [2].Since clustering is usually done using K-means, which could take 
> some time with large data, this was already done previously and the results 
> were made publicly available in [3] with up to 20 closest clusters per 
> "phrase".
> The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
> described in their paper.I believe this is straight-forward enough to 
> implement as another unsupervised approach for the proposed time-frame.
> Would like your opinion.

Sounds good to me. I will read the paper now, and come back here if I
have any questions.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
> Hi,
> 
> I attached an initial patch to OPENNLP-758.
> However, we are currently modifying things a bit since many approaches need 
> to be supported, but would like your recommendations.
> Here are some notes : 
> 
> 1 - We used extJWNL
> 2- [WSDisambiguator] is the main interface
> 3- [Loader] loads the resources required
> 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
> 5- [Lesk] has many variants, we already implemented some, but wondering on 
> the preferred way to switch from one to the other:
> As of now we use one of them as default, but we thought of either making a 
> parameter list to fill or make separate classes for each, or otherwise 
> following your preference.
> 6- The other classes are for convenience.
> 
> We will try to patch frequently on the separate issues, following the 
> feedback.


Sounds good, I reviewed it and think what we have is quite ok.

Most important now is to fix the smaller issues (see the jira issue) and
explain to us how it can be run.

The midterm evaluation is coming up next week as well.

How are we standing with the milstone we set?

Jörn



signature.asc
Description: This is a digitally signed message part


RE: GSoC 2015 - WSD Module

2015-06-21 Thread Anthony Beylerian
Dear Jörn,
Thank you for that.

After further surveying, I was thinking of beginning the implementation of an 
approach based on context clustering as a next step.
Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
dataset [2].Since clustering is usually done using K-means, which could take 
some time with large data, this was already done previously and the results 
were made publicly available in [3] with up to 20 closest clusters per "phrase".
The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
described in their paper.I believe this is straight-forward enough to implement 
as another unsupervised approach for the proposed time-frame.
Would like your opinion.
Regards,
Anthony
[1] http://nlp.cs.rpi.edu/paper/wsd.pdf[2] 
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html[3] 
http://webdocs.cs.ualberta.ca/~bergsma/PhrasalClusters/


> Date: Fri, 19 Jun 2015 16:41:20 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> Hello,
> 
> I will dedicate time tonight to get this pulled in the sandbox and will
> then also provide some feedback.
> We can then create new patches against the sandbox to fix further issues.
> 
> Jörn
> 
> On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian <
> anthonybeyler...@hotmail.com> wrote:
> 
> > Thank you for the reply, I am guessing for now we will use the other
> > sources.
> >
> > By the way, I  have uploaded a newer patch on the same issue [1].
> > Would like to know if the approach to set parameters is acceptable.
> >
> > Also, we are referencing to some model files locally like tokenizer,
> > tagger, etc because we need them for the preprocessing chain.for example :
> >
> > ++
> > private static String modelsDir =
> > "src\\test\\resources\\opennlp\\tools\\disambiguator\\";
> >
> > TokenizerModel  tokenizerModel = new TokenizerModel(new
> > FileInputStream(modelsDir + "en-token.bin"));tokenizer = new
> > TokenizerME(tokenizerModel);
> > ++
> >
> > Thought of adding these files (.bin) in the test folder, but could anyone
> > recommend a more elegant way  to do this ?
> > Thanks !
> >
> > Anthony
> >
> > [1] : https://issues.apache.org/jira/browse/OPENNLP-758
> >
> >
> > > From: rage...@apache.org
> > > Date: Fri, 19 Jun 2015 10:18:12 +0200
> > > Subject: Re: GSoC 2015 - WSD Module
> > > To: dev@opennlp.apache.org
> > >
> > > Thanks for the update and the updated patch.
> > >
> > > With respect to the licensing of BabelNet, I do not think we can
> > > redistribute CC BY-NC-SA resources here, but others in this project
> > > and Apache in general will probably know better than me.
> > >
> > > Best,
> > >
> > > Rodrigo
  

Re: GSoC 2015 - WSD Module

2015-06-19 Thread Joern Kottmann
Hello,

I will dedicate time tonight to get this pulled in the sandbox and will
then also provide some feedback.
We can then create new patches against the sandbox to fix further issues.

Jörn

On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Thank you for the reply, I am guessing for now we will use the other
> sources.
>
> By the way, I  have uploaded a newer patch on the same issue [1].
> Would like to know if the approach to set parameters is acceptable.
>
> Also, we are referencing to some model files locally like tokenizer,
> tagger, etc because we need them for the preprocessing chain.for example :
>
> ++
> private static String modelsDir =
> "src\\test\\resources\\opennlp\\tools\\disambiguator\\";
>
> TokenizerModel  tokenizerModel = new TokenizerModel(new
> FileInputStream(modelsDir + "en-token.bin"));tokenizer = new
> TokenizerME(tokenizerModel);
> ++
>
> Thought of adding these files (.bin) in the test folder, but could anyone
> recommend a more elegant way  to do this ?
> Thanks !
>
> Anthony
>
> [1] : https://issues.apache.org/jira/browse/OPENNLP-758
>
>
> > From: rage...@apache.org
> > Date: Fri, 19 Jun 2015 10:18:12 +0200
> > Subject: Re: GSoC 2015 - WSD Module
> > To: dev@opennlp.apache.org
> >
> > Thanks for the update and the updated patch.
> >
> > With respect to the licensing of BabelNet, I do not think we can
> > redistribute CC BY-NC-SA resources here, but others in this project
> > and Apache in general will probably know better than me.
> >
> > Best,
> >
> > Rodrigo
> >
> > On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
> >  wrote:
> > > Hi,
> > > Concerning this point, I would like to ask about BabelNet [1].The
> advantages of [1] is that it integrates WordNet, Wikipedia, Wiktionary,
> OmegaWiki, Wikidata, and Open Multi-WordNet.
> > > Also, the newest SemEval task (which results are just out [2]) relies
> on it.
> > >
> > > Howeover, the 2.5.1 version, which can be used locally, follows a CC
> BY-NC-SA 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are
> acceptable, however I am not completely sure if the NC-SA
> (Non-commercial/ShareAlike) terms would be prohibitive since it was
> mentioned that :
> > > "Many of these licenses have specific attribution terms that need to
> be adhered to, for example CC-A, often by adding them to the NOTICE file.
> Ensure you are doing this when including these works. Note, this list is
> colloquially known as the Category A list."
> > > Would like your thoughts on the matter.
> > > Thanks !
> > > Anthony
> > > [1] : http://babelnet.org/download[2] :
> http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] :
> https://creativecommons.org/licenses/by-nc-sa/3.0/
> > > [4] : http://www.apache.org/legal/resolved.html#category-a
> > >
> > >> Date: Fri, 5 Jun 2015 15:09:24 +0200
> > >> Subject: Re: GSoC 2015 - WSD Module
> > >> From: kottm...@gmail.com
> > >> To: dev@opennlp.apache.org
> > >>
> > >> Hello,
> > >>
> > >> yes, wordnet is fine, we already depend on it. I just think that
> remote
> > >> resources are particular problematic.
> > >>
> > >> For local resources it boils down to their license.
> > >>
> > >> Here is the wordnet one:
> > >> http://wordnet.princeton.edu/wordnet/license/
> > >>
> > >> We might even be able to redistribute this here at Apache, which is
> really
> > >> nice. To do that we have to check
> > >> with the legal list if they give a green light for it.
> > >>
> > >> You can get more information about licenses and dependencies for
> Apache
> > >> projects here:
> > >> http://www.apache.org/legal/resolved.html#category-a
> > >> http://www.apache.org/legal/resolved.html#category-b
> > >> http://www.apache.org/legal/resolved.html#category-x
> > >>
> > >> Are the things you have to clean up of the nature that you couldn't
> do that
> > >> after you send in a patch?
> > >> This could be removal of code which can be released under ASL.
> > >>
> > >> We would like to get you integrated into the way we work here as
> quickly as
> > >> possible.
> > >>
> > >> That includes:
> > >> - Tasks are planned/tracked via jira (this allows other people to
> > >> comment/follow)
> > >> - We would like to be able to review your code and maybe give some
> advice
> > >> (commit often, break things down in tasks)
> > >> - Changes or new features are usually discussed a on the dev list
> (e.g. a
> > >> short write up about the approaches you implemented
> > >>   or better plan to implement)
> > >>
> > >> Jörn
> > >
> > >
>
>


RE: GSoC 2015 - WSD Module

2015-06-19 Thread Anthony Beylerian
Thank you for the reply, I am guessing for now we will use the other sources.

By the way, I  have uploaded a newer patch on the same issue [1].
Would like to know if the approach to set parameters is acceptable.

Also, we are referencing to some model files locally like tokenizer, tagger, 
etc because we need them for the preprocessing chain.for example :

++
private static String modelsDir = 
"src\\test\\resources\\opennlp\\tools\\disambiguator\\";

TokenizerModel  tokenizerModel = new TokenizerModel(new 
FileInputStream(modelsDir + "en-token.bin"));tokenizer = new 
TokenizerME(tokenizerModel);
++

Thought of adding these files (.bin) in the test folder, but could anyone 
recommend a more elegant way  to do this ?
Thanks !

Anthony

[1] : https://issues.apache.org/jira/browse/OPENNLP-758


> From: rage...@apache.org
> Date: Fri, 19 Jun 2015 10:18:12 +0200
> Subject: Re: GSoC 2015 - WSD Module
> To: dev@opennlp.apache.org
> 
> Thanks for the update and the updated patch.
> 
> With respect to the licensing of BabelNet, I do not think we can
> redistribute CC BY-NC-SA resources here, but others in this project
> and Apache in general will probably know better than me.
> 
> Best,
> 
> Rodrigo
> 
> On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
>  wrote:
> > Hi,
> > Concerning this point, I would like to ask about BabelNet [1].The 
> > advantages of [1] is that it integrates WordNet, Wikipedia, Wiktionary, 
> > OmegaWiki, Wikidata, and Open Multi-WordNet.
> > Also, the newest SemEval task (which results are just out [2]) relies on it.
> >
> > Howeover, the 2.5.1 version, which can be used locally, follows a CC 
> > BY-NC-SA 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are 
> > acceptable, however I am not completely sure if the NC-SA 
> > (Non-commercial/ShareAlike) terms would be prohibitive since it was 
> > mentioned that :
> > "Many of these licenses have specific attribution terms that need to be 
> > adhered to, for example CC-A, often by adding them to the NOTICE file. 
> > Ensure you are doing this when including these works. Note, this list is 
> > colloquially known as the Category A list."
> > Would like your thoughts on the matter.
> > Thanks !
> > Anthony
> > [1] : http://babelnet.org/download[2] : 
> > http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
> > https://creativecommons.org/licenses/by-nc-sa/3.0/
> > [4] : http://www.apache.org/legal/resolved.html#category-a
> >
> >> Date: Fri, 5 Jun 2015 15:09:24 +0200
> >> Subject: Re: GSoC 2015 - WSD Module
> >> From: kottm...@gmail.com
> >> To: dev@opennlp.apache.org
> >>
> >> Hello,
> >>
> >> yes, wordnet is fine, we already depend on it. I just think that remote
> >> resources are particular problematic.
> >>
> >> For local resources it boils down to their license.
> >>
> >> Here is the wordnet one:
> >> http://wordnet.princeton.edu/wordnet/license/
> >>
> >> We might even be able to redistribute this here at Apache, which is really
> >> nice. To do that we have to check
> >> with the legal list if they give a green light for it.
> >>
> >> You can get more information about licenses and dependencies for Apache
> >> projects here:
> >> http://www.apache.org/legal/resolved.html#category-a
> >> http://www.apache.org/legal/resolved.html#category-b
> >> http://www.apache.org/legal/resolved.html#category-x
> >>
> >> Are the things you have to clean up of the nature that you couldn't do that
> >> after you send in a patch?
> >> This could be removal of code which can be released under ASL.
> >>
> >> We would like to get you integrated into the way we work here as quickly as
> >> possible.
> >>
> >> That includes:
> >> - Tasks are planned/tracked via jira (this allows other people to
> >> comment/follow)
> >> - We would like to be able to review your code and maybe give some advice
> >> (commit often, break things down in tasks)
> >> - Changes or new features are usually discussed a on the dev list (e.g. a
> >> short write up about the approaches you implemented
> >>   or better plan to implement)
> >>
> >> Jörn
> >
> >
  

Re: GSoC 2015 - WSD Module

2015-06-19 Thread Rodrigo Agerri
Thanks for the update and the updated patch.

With respect to the licensing of BabelNet, I do not think we can
redistribute CC BY-NC-SA resources here, but others in this project
and Apache in general will probably know better than me.

Best,

Rodrigo

On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
 wrote:
> Hi,
> Concerning this point, I would like to ask about BabelNet [1].The advantages 
> of [1] is that it integrates WordNet, Wikipedia, Wiktionary, OmegaWiki, 
> Wikidata, and Open Multi-WordNet.
> Also, the newest SemEval task (which results are just out [2]) relies on it.
>
> Howeover, the 2.5.1 version, which can be used locally, follows a CC BY-NC-SA 
> 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are 
> acceptable, however I am not completely sure if the NC-SA 
> (Non-commercial/ShareAlike) terms would be prohibitive since it was mentioned 
> that :
> "Many of these licenses have specific attribution terms that need to be 
> adhered to, for example CC-A, often by adding them to the NOTICE file. Ensure 
> you are doing this when including these works. Note, this list is 
> colloquially known as the Category A list."
> Would like your thoughts on the matter.
> Thanks !
> Anthony
> [1] : http://babelnet.org/download[2] : 
> http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
> https://creativecommons.org/licenses/by-nc-sa/3.0/
> [4] : http://www.apache.org/legal/resolved.html#category-a
>
>> Date: Fri, 5 Jun 2015 15:09:24 +0200
>> Subject: Re: GSoC 2015 - WSD Module
>> From: kottm...@gmail.com
>> To: dev@opennlp.apache.org
>>
>> Hello,
>>
>> yes, wordnet is fine, we already depend on it. I just think that remote
>> resources are particular problematic.
>>
>> For local resources it boils down to their license.
>>
>> Here is the wordnet one:
>> http://wordnet.princeton.edu/wordnet/license/
>>
>> We might even be able to redistribute this here at Apache, which is really
>> nice. To do that we have to check
>> with the legal list if they give a green light for it.
>>
>> You can get more information about licenses and dependencies for Apache
>> projects here:
>> http://www.apache.org/legal/resolved.html#category-a
>> http://www.apache.org/legal/resolved.html#category-b
>> http://www.apache.org/legal/resolved.html#category-x
>>
>> Are the things you have to clean up of the nature that you couldn't do that
>> after you send in a patch?
>> This could be removal of code which can be released under ASL.
>>
>> We would like to get you integrated into the way we work here as quickly as
>> possible.
>>
>> That includes:
>> - Tasks are planned/tracked via jira (this allows other people to
>> comment/follow)
>> - We would like to be able to review your code and maybe give some advice
>> (commit often, break things down in tasks)
>> - Changes or new features are usually discussed a on the dev list (e.g. a
>> short write up about the approaches you implemented
>>   or better plan to implement)
>>
>> Jörn
>
>


RE: GSoC 2015 - WSD Module

2015-06-14 Thread Anthony Beylerian
Hi,
Concerning this point, I would like to ask about BabelNet [1].The advantages of 
[1] is that it integrates WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata, 
and Open Multi-WordNet.
Also, the newest SemEval task (which results are just out [2]) relies on it.

Howeover, the 2.5.1 version, which can be used locally, follows a CC BY-NC-SA 
3.0 license [3].I read in [4] that CC-A (Attribution) licenses are acceptable, 
however I am not completely sure if the NC-SA (Non-commercial/ShareAlike) terms 
would be prohibitive since it was mentioned that : 
"Many of these licenses have specific attribution terms that need to be adhered 
to, for example CC-A, often by adding them to the NOTICE file. Ensure you are 
doing this when including these works. Note, this list is colloquially known as 
the Category A list."
Would like your thoughts on the matter.
Thanks !
Anthony
[1] : http://babelnet.org/download[2] : 
http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
https://creativecommons.org/licenses/by-nc-sa/3.0/
[4] : http://www.apache.org/legal/resolved.html#category-a

> Date: Fri, 5 Jun 2015 15:09:24 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> Hello,
> 
> yes, wordnet is fine, we already depend on it. I just think that remote
> resources are particular problematic.
> 
> For local resources it boils down to their license.
> 
> Here is the wordnet one:
> http://wordnet.princeton.edu/wordnet/license/
> 
> We might even be able to redistribute this here at Apache, which is really
> nice. To do that we have to check
> with the legal list if they give a green light for it.
> 
> You can get more information about licenses and dependencies for Apache
> projects here:
> http://www.apache.org/legal/resolved.html#category-a
> http://www.apache.org/legal/resolved.html#category-b
> http://www.apache.org/legal/resolved.html#category-x
> 
> Are the things you have to clean up of the nature that you couldn't do that
> after you send in a patch?
> This could be removal of code which can be released under ASL.
> 
> We would like to get you integrated into the way we work here as quickly as
> possible.
> 
> That includes:
> - Tasks are planned/tracked via jira (this allows other people to
> comment/follow)
> - We would like to be able to review your code and maybe give some advice
> (commit often, break things down in tasks)
> - Changes or new features are usually discussed a on the dev list (e.g. a
> short write up about the approaches you implemented
>   or better plan to implement)
> 
> Jörn

  

RE: GSoC 2015 - WSD Module

2015-06-10 Thread Anthony Beylerian
Hi,

I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to 
be supported, but would like your recommendations.
Here are some notes : 

1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the 
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a 
parameter list to fill or make separate classes for each, or otherwise 
following your preference.
6- The other classes are for convenience.

We will try to patch frequently on the separate issues, following the feedback.

Best regards,

Anthony

> Date: Wed, 10 Jun 2015 11:42:56 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> You can attach the patch to one of the issues, you can create an new issue.
> In the end it doesn't matter much, but important is that we make progress
> here and get the initial code into our repository. Subsequent changes can
> then be done in a patch series.
> 
> Please try to submit the patch as quickly as possible.
> 
> Jörn
> 
> On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri  wrote:
> 
> > Hello,
> >
> > On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
> >  wrote:
> > > Dear Rodrigo,
> > >
> > > As Anthony mentioned in his previous email, I already started the
> > > implementation of the IMS approach. The pre-processing and the extraction
> > > of features have already been finished. Regarding the approach itself, it
> > > shows some potential according to the author though the features proposed
> > > are not so many, and are basic.
> >
> > Hi, yes, the features are not that complex, but it is good to have a
> > working system and then if needed the feature set can be
> > improved/enriched. As stated in the paper, the IMS approach leverages
> > parallel data to obtain state of the art results in both lexical
> > sample and all words for senseval 3 and semeval 2007 datasets.
> >
> > I think it will be nice to have a working system with this algorithm
> > as part of the WSD component in OpenNLP (following the API discussion
> > previous in this thread) and perform some evaluations to know where
> > the system is with respect to state of the art results in those
> > datasets. Once this is operative, I think it will be a good moment to
> > start discussing additional/better features.
> >
> > > I think the approach itself might be
> > > enhanced if we add more context specific features from some other
> > > approaches... (To do that, I need to run many experiments using different
> > > combinations of features, however, that should not be a problem).
> >
> > Speaking about the feature sets, in the API google doc I have not seen
> > anything about the implementation of the feature extractors, could you
> > perhaps provide some extra info (in that same document, for example)
> > about that?
> >
> > > But the approach itself requires a linear SVM classifier, and as far as I
> > > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
> > libsvm
> > > ?
> >
> > I think you can try with a MaxEnt to start with and in the meantime,
> > @Jörn has commented sometimes that there is a plugin component in
> > OpenNLP to use third-party ML libraries and that he tested it with
> > Mallet. Perhaps he could comment on this to use that functionality to
> > use SVMs.
> >
> > >
> > > Regarding the training data, I started collecting some from different
> > > sources. Most of the existing rich corpora are licensed (Including the
> > ones
> > > mentioned in the paper). The free ones I got for now are from the
> > Senseval
> > > and Semeval websites. However, these are used just to evaluate the
> > proposed
> > > methods in the workshops. Therefore, the words to disambiguate are few in
> > > number though the training data for each word are rich enough.
> > >
> > > In any case, the first tests with Senseval and Semeval collected should
> > be
> > > finished soon. However, I am not sure if there is a rich enough Dataset
> > we
> > > can use to make our model for the WSD module in the OpenNLP library.
> > > If you have any recommendation, I would be grateful if you can help me on
> > > this point.
> >
> > Well, as I said in my previous emai

Re: GSoC 2015 - WSD Module

2015-06-10 Thread Joern Kottmann
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.

Please try to submit the patch as quickly as possible.

Jörn

On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri  wrote:

> Hello,
>
> On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
>  wrote:
> > Dear Rodrigo,
> >
> > As Anthony mentioned in his previous email, I already started the
> > implementation of the IMS approach. The pre-processing and the extraction
> > of features have already been finished. Regarding the approach itself, it
> > shows some potential according to the author though the features proposed
> > are not so many, and are basic.
>
> Hi, yes, the features are not that complex, but it is good to have a
> working system and then if needed the feature set can be
> improved/enriched. As stated in the paper, the IMS approach leverages
> parallel data to obtain state of the art results in both lexical
> sample and all words for senseval 3 and semeval 2007 datasets.
>
> I think it will be nice to have a working system with this algorithm
> as part of the WSD component in OpenNLP (following the API discussion
> previous in this thread) and perform some evaluations to know where
> the system is with respect to state of the art results in those
> datasets. Once this is operative, I think it will be a good moment to
> start discussing additional/better features.
>
> > I think the approach itself might be
> > enhanced if we add more context specific features from some other
> > approaches... (To do that, I need to run many experiments using different
> > combinations of features, however, that should not be a problem).
>
> Speaking about the feature sets, in the API google doc I have not seen
> anything about the implementation of the feature extractors, could you
> perhaps provide some extra info (in that same document, for example)
> about that?
>
> > But the approach itself requires a linear SVM classifier, and as far as I
> > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
> libsvm
> > ?
>
> I think you can try with a MaxEnt to start with and in the meantime,
> @Jörn has commented sometimes that there is a plugin component in
> OpenNLP to use third-party ML libraries and that he tested it with
> Mallet. Perhaps he could comment on this to use that functionality to
> use SVMs.
>
> >
> > Regarding the training data, I started collecting some from different
> > sources. Most of the existing rich corpora are licensed (Including the
> ones
> > mentioned in the paper). The free ones I got for now are from the
> Senseval
> > and Semeval websites. However, these are used just to evaluate the
> proposed
> > methods in the workshops. Therefore, the words to disambiguate are few in
> > number though the training data for each word are rich enough.
> >
> > In any case, the first tests with Senseval and Semeval collected should
> be
> > finished soon. However, I am not sure if there is a rich enough Dataset
> we
> > can use to make our model for the WSD module in the OpenNLP library.
> > If you have any recommendation, I would be grateful if you can help me on
> > this point.
>
> Well, as I said in my previous email, research around "word senses" is
> moving from WSD towards Supersense tagging where there are recent
> papers and freely available tweet datasets, for example. In any case,
> we can look more into it but in the meantime the Semcor for training
> and senseval/semeval2007 datasets for evaluation should be enough to
> compare your system with the literature.
>
> >
> > As Jörn mentioned sending an initial patch, should we separate our codes
> > and upload two different patches to the two issues we created on the Jira
> > (however, this means a lot of redundancy in the code), or shall we keep
> > them in one project and upload it? If we opt for the latter case, which
> > issue should we upload the patch to ?
>
> In my opinion, it should be the same patch and same Component with
> different algorithm implementations within it. Any other opinions?
>
> Cheers,
>
> Rodrigo
>


Re: GSoC 2015 - WSD Module

2015-06-08 Thread Rodrigo Agerri
Hello,

On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
 wrote:
> Dear Rodrigo,
>
> As Anthony mentioned in his previous email, I already started the
> implementation of the IMS approach. The pre-processing and the extraction
> of features have already been finished. Regarding the approach itself, it
> shows some potential according to the author though the features proposed
> are not so many, and are basic.

Hi, yes, the features are not that complex, but it is good to have a
working system and then if needed the feature set can be
improved/enriched. As stated in the paper, the IMS approach leverages
parallel data to obtain state of the art results in both lexical
sample and all words for senseval 3 and semeval 2007 datasets.

I think it will be nice to have a working system with this algorithm
as part of the WSD component in OpenNLP (following the API discussion
previous in this thread) and perform some evaluations to know where
the system is with respect to state of the art results in those
datasets. Once this is operative, I think it will be a good moment to
start discussing additional/better features.

> I think the approach itself might be
> enhanced if we add more context specific features from some other
> approaches... (To do that, I need to run many experiments using different
> combinations of features, however, that should not be a problem).

Speaking about the feature sets, in the API google doc I have not seen
anything about the implementation of the feature extractors, could you
perhaps provide some extra info (in that same document, for example)
about that?

> But the approach itself requires a linear SVM classifier, and as far as I
> know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm
> ?

I think you can try with a MaxEnt to start with and in the meantime,
@Jörn has commented sometimes that there is a plugin component in
OpenNLP to use third-party ML libraries and that he tested it with
Mallet. Perhaps he could comment on this to use that functionality to
use SVMs.

>
> Regarding the training data, I started collecting some from different
> sources. Most of the existing rich corpora are licensed (Including the ones
> mentioned in the paper). The free ones I got for now are from the Senseval
> and Semeval websites. However, these are used just to evaluate the proposed
> methods in the workshops. Therefore, the words to disambiguate are few in
> number though the training data for each word are rich enough.
>
> In any case, the first tests with Senseval and Semeval collected should be
> finished soon. However, I am not sure if there is a rich enough Dataset we
> can use to make our model for the WSD module in the OpenNLP library.
> If you have any recommendation, I would be grateful if you can help me on
> this point.

Well, as I said in my previous email, research around "word senses" is
moving from WSD towards Supersense tagging where there are recent
papers and freely available tweet datasets, for example. In any case,
we can look more into it but in the meantime the Semcor for training
and senseval/semeval2007 datasets for evaluation should be enough to
compare your system with the literature.

>
> As Jörn mentioned sending an initial patch, should we separate our codes
> and upload two different patches to the two issues we created on the Jira
> (however, this means a lot of redundancy in the code), or shall we keep
> them in one project and upload it? If we opt for the latter case, which
> issue should we upload the patch to ?

In my opinion, it should be the same patch and same Component with
different algorithm implementations within it. Any other opinions?

Cheers,

Rodrigo


Re: GSoC 2015 - WSD Module

2015-06-08 Thread Mondher Bouazizi
Dear Rodrigo,

As Anthony mentioned in his previous email, I already started the
implementation of the IMS approach. The pre-processing and the extraction
of features have already been finished. Regarding the approach itself, it
shows some potential according to the author though the features proposed
are not so many, and are basic. I think the approach itself might be
enhanced if we add more context specific features from some other
approaches... (To do that, I need to run many experiments using different
combinations of features, however, that should not be a problem).
But the approach itself requires a linear SVM classifier, and as far as I
know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm
?

Regarding the training data, I started collecting some from different
sources. Most of the existing rich corpora are licensed (Including the ones
mentioned in the paper). The free ones I got for now are from the Senseval
and Semeval websites. However, these are used just to evaluate the proposed
methods in the workshops. Therefore, the words to disambiguate are few in
number though the training data for each word are rich enough.

In any case, the first tests with Senseval and Semeval collected should be
finished soon. However, I am not sure if there is a rich enough Dataset we
can use to make our model for the WSD module in the OpenNLP library.
If you have any recommendation, I would be grateful if you can help me on
this point.

On the other hand, we're cleaning our implementation of the different
variations of Lesk. However, we are currently using JWNL. If there are no
objections, we will migrate to extJWNL.

As Jörn mentioned sending an initial patch, should we separate our codes
and upload two different patches to the two issues we created on the Jira
(however, this means a lot of redundancy in the code), or shall we keep
them in one project and upload it? If we opt for the latter case, which
issue should we upload the patch to ?

Thanks,

Mondher, Anthony

On Mon, Jun 8, 2015 at 7:51 PM, Rodrigo Agerri  wrote:

> Hello,
>
> +1 for using extJWNL instead of JWNL, I use it in some other projects
> too and it is very nice IMHO.
>
> R
>
> On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu
>  wrote:
> > Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
> > have questions.
> >
> > Aliaksandr
> >
> > On 6 June 2015 at 11:43, Richard Eckart de Castilho <
> > richard.eck...@gmail.com> wrote:
> >
> >> On 05.06.2015, at 14:24, Anthony Beylerian <
> anthonybeyler...@hotmail.com>
> >> wrote:
> >>
> >> > So just to make sure, we are currently relying on JWNL to access
> WordNet
> >> as a resource.
> >>
> >> There is a more modern fork of JWNL available called
> >> http://extjwnl.sourceforge.net .
> >> It includes provisions of loading WordNet from the classpath, e.g.
> >> from Maven dependencies. It might be a nice replacement for JWNL and is
> >> also licensed
> >> under the BSD license. Pre-packaged WordNet Maven artifacts are also
> >> available.
> >>
> >> Cheers,
> >>
> >> -- Richard
>


Re: GSoC 2015 - WSD Module

2015-06-08 Thread Rodrigo Agerri
Hello,

+1 for using extJWNL instead of JWNL, I use it in some other projects
too and it is very nice IMHO.

R

On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu
 wrote:
> Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
> have questions.
>
> Aliaksandr
>
> On 6 June 2015 at 11:43, Richard Eckart de Castilho <
> richard.eck...@gmail.com> wrote:
>
>> On 05.06.2015, at 14:24, Anthony Beylerian 
>> wrote:
>>
>> > So just to make sure, we are currently relying on JWNL to access WordNet
>> as a resource.
>>
>> There is a more modern fork of JWNL available called
>> http://extjwnl.sourceforge.net .
>> It includes provisions of loading WordNet from the classpath, e.g.
>> from Maven dependencies. It might be a nice replacement for JWNL and is
>> also licensed
>> under the BSD license. Pre-packaged WordNet Maven artifacts are also
>> available.
>>
>> Cheers,
>>
>> -- Richard


Re: GSoC 2015 - WSD Module

2015-06-06 Thread Aliaksandr Autayeu
Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
have questions.

Aliaksandr

On 6 June 2015 at 11:43, Richard Eckart de Castilho <
richard.eck...@gmail.com> wrote:

> On 05.06.2015, at 14:24, Anthony Beylerian 
> wrote:
>
> > So just to make sure, we are currently relying on JWNL to access WordNet
> as a resource.
>
> There is a more modern fork of JWNL available called
> http://extjwnl.sourceforge.net .
> It includes provisions of loading WordNet from the classpath, e.g.
> from Maven dependencies. It might be a nice replacement for JWNL and is
> also licensed
> under the BSD license. Pre-packaged WordNet Maven artifacts are also
> available.
>
> Cheers,
>
> -- Richard


Re: GSoC 2015 - WSD Module

2015-06-06 Thread Richard Eckart de Castilho
On 05.06.2015, at 14:24, Anthony Beylerian  wrote:

> So just to make sure, we are currently relying on JWNL to access WordNet as a 
> resource. 

There is a more modern fork of JWNL available called 
http://extjwnl.sourceforge.net .
It includes provisions of loading WordNet from the classpath, e.g.
from Maven dependencies. It might be a nice replacement for JWNL and is also 
licensed
under the BSD license. Pre-packaged WordNet Maven artifacts are also 
available.

Cheers,

-- Richard

Re: GSoC 2015 - WSD Module

2015-06-05 Thread Joern Kottmann
Hello,

yes, wordnet is fine, we already depend on it. I just think that remote
resources are particular problematic.

For local resources it boils down to their license.

Here is the wordnet one:
http://wordnet.princeton.edu/wordnet/license/

We might even be able to redistribute this here at Apache, which is really
nice. To do that we have to check
with the legal list if they give a green light for it.

You can get more information about licenses and dependencies for Apache
projects here:
http://www.apache.org/legal/resolved.html#category-a
http://www.apache.org/legal/resolved.html#category-b
http://www.apache.org/legal/resolved.html#category-x

Are the things you have to clean up of the nature that you couldn't do that
after you send in a patch?
This could be removal of code which can be released under ASL.

We would like to get you integrated into the way we work here as quickly as
possible.

That includes:
- Tasks are planned/tracked via jira (this allows other people to
comment/follow)
- We would like to be able to review your code and maybe give some advice
(commit often, break things down in tasks)
- Changes or new features are usually discussed a on the dev list (e.g. a
short write up about the approaches you implemented
  or better plan to implement)

Jörn




On Fri, Jun 5, 2015 at 2:24 PM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Hi,
>
> We understand the issues.
>
> So just to make sure, we are currently relying on JWNL to access WordNet
> as a resource. Is that fine for now ?
>
> In case we need to avoid such dependencies,  would it be ok to create a
> resource file that includes what we need extracted from it or also from
> other resources combined (sense inventory, word relationships and so on) ?
> We'd like your recommendation.
>
> Also we are currently cleaning up the project and will upload a patch.
> To sum up, we have already implemented the Lesk approach, as well as parts
> of the supervised IMS approach (preprocessing, feature extraction).
> Next, we will implement the baseline techniques and collect the training
> data that will be used by supervised approaches.
> Files will be collected from different sources and will be unified in a
> single model file.
> Best regards,
>
> Anthony, Mondher
>
>
> > Date: Wed, 3 Jun 2015 16:47:50 +0200
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> >
> > We should not use remote resources. A remote service adds severe limits
> to
> > the WSD component. A remote resource will be slow to query (compared to
> > disk or memory), queries might be expensive (pay per request), the
> license
> > might not allow usage in a way the ASL promises to our users. Another
> issue
> > is that calling a remote service might leak the document text itself to
> > that remote service.
> >
> > Please attach a patch to the jira issue, and then we can pull it into the
> > sandbox.
> >
> > Jörn
> >
> >
> >
> >
> >
> > On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
> > anthonybeyler...@hotmail.com> wrote:
> >
> > > Dear Jörn,
> > >
> > > Thank you for the reply.===
> > > Yes in the draft WSDisambiguator is the main interface.
> > > ===
> > > Yes for the disambiguate method the input is expected to be tokenized,
> it
> > > should be an input array.
> > > The second argument is for the token index.  We can also make it into
> an
> > > index array to support multiple words.
> > > ===
> > > Concerning the resources, we expect two types of resources : local and
> > > remote resources.
> > >
> > > + For local resources, we have two main types :
> > > 1- training models for supervised techniques.
> > > 2- knowledge resources
> > >
> > > It could be best to make the packaging using similar OpenNLP models
> for #1.
> > > As for #2, it will depend on what we want to use,  since the type of
> > > information depends on the specific technique.
> > >
> > > + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might
> need
> > > to have some REST support, for example to retrieve a sense inventory
> for a
> > > certain word.Actually, the newest semeval task [Semeval15] will use
> > > [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> > > version, but the newest one is only available through REST.Also, in
> case it
> > > is needed to use a remote resource, AND it typically 

RE: GSoC 2015 - WSD Module

2015-06-05 Thread Anthony Beylerian
Hi,

We understand the issues.

So just to make sure, we are currently relying on JWNL to access WordNet as a 
resource. Is that fine for now ? 

In case we need to avoid such dependencies,  would it be ok to create a 
resource file that includes what we need extracted from it or also from other 
resources combined (sense inventory, word relationships and so on) ?
We'd like your recommendation.

Also we are currently cleaning up the project and will upload a patch.
To sum up, we have already implemented the Lesk approach, as well as parts of 
the supervised IMS approach (preprocessing, feature extraction).
Next, we will implement the baseline techniques and collect the training data 
that will be used by supervised approaches.
Files will be collected from different sources and will be unified in a single 
model file.
Best regards,

Anthony, Mondher


> Date: Wed, 3 Jun 2015 16:47:50 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> We should not use remote resources. A remote service adds severe limits to
> the WSD component. A remote resource will be slow to query (compared to
> disk or memory), queries might be expensive (pay per request), the license
> might not allow usage in a way the ASL promises to our users. Another issue
> is that calling a remote service might leak the document text itself to
> that remote service.
> 
> Please attach a patch to the jira issue, and then we can pull it into the
> sandbox.
> 
> Jörn
> 
> 
> 
> 
> 
> On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
> anthonybeyler...@hotmail.com> wrote:
> 
> > Dear Jörn,
> >
> > Thank you for the reply.===
> > Yes in the draft WSDisambiguator is the main interface.
> > ===
> > Yes for the disambiguate method the input is expected to be tokenized, it
> > should be an input array.
> > The second argument is for the token index.  We can also make it into an
> > index array to support multiple words.
> > ===
> > Concerning the resources, we expect two types of resources : local and
> > remote resources.
> >
> > + For local resources, we have two main types :
> > 1- training models for supervised techniques.
> > 2- knowledge resources
> >
> > It could be best to make the packaging using similar OpenNLP models for #1.
> > As for #2, it will depend on what we want to use,  since the type of
> > information depends on the specific technique.
> >
> > + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
> > to have some REST support, for example to retrieve a sense inventory for a
> > certain word.Actually, the newest semeval task [Semeval15] will use
> > [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> > version, but the newest one is only available through REST.Also, in case it
> > is needed to use a remote resource, AND it typically requires a license, we
> > need to use a license key or just use the free quota with no key.
> >
> > Therefore, we thought of having a [ResourceProvider] as mentioned in the
> > [draft].
> > Are there any plans to add an external API connector of the sort or is
> > this functionality already possible for extension ?
> > (I noticed there is a [wikinews_importer] in the sanbox)
> >
> > But in any case we can always start working only locally as a first step,
> > what do you think ?
> > ===
> > It would be more straightforward to use the algorithm names, so ok why not.
> > ===
> > Yes we have already started working !
> > What do we need to push to the sandbox ?
> > ===========
> >
> > Thanks !
> >
> > Anthony
> >
> > [BabelNet] : http://babelnet.org/download
> > [WordsAPI] : https://www.wordsapi.com/
> > [Semeval15] : http://alt.qcri.org/semeval2015/task13/
> > [draft] :
> > https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
> >
> >
> > > Subject: Re: GSoC 2015 - WSD Module
> > > From: kottm...@gmail.com
> > > To: dev@opennlp.apache.org
> > > Date: Mon, 1 Jun 2015 20:30:08 +0200
> > >
> > > Hello,
> > >
> > > I had a look at your APIs.
> > >
> > > Lets start with the WSDisambiguator. Should that be an interface?
> > >
> > > // returns the senses ordered by their score (best one first or only 1
> > > in supervised case)
> > > String[] disambiguate(String inputText,int

Re: GSoC 2015 - WSD Module

2015-06-03 Thread Joern Kottmann
We should not use remote resources. A remote service adds severe limits to
the WSD component. A remote resource will be slow to query (compared to
disk or memory), queries might be expensive (pay per request), the license
might not allow usage in a way the ASL promises to our users. Another issue
is that calling a remote service might leak the document text itself to
that remote service.

Please attach a patch to the jira issue, and then we can pull it into the
sandbox.

Jörn





On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Dear Jörn,
>
> Thank you for the reply.===
> Yes in the draft WSDisambiguator is the main interface.
> ===
> Yes for the disambiguate method the input is expected to be tokenized, it
> should be an input array.
> The second argument is for the token index.  We can also make it into an
> index array to support multiple words.
> ===
> Concerning the resources, we expect two types of resources : local and
> remote resources.
>
> + For local resources, we have two main types :
> 1- training models for supervised techniques.
> 2- knowledge resources
>
> It could be best to make the packaging using similar OpenNLP models for #1.
> As for #2, it will depend on what we want to use,  since the type of
> information depends on the specific technique.
>
> + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
> to have some REST support, for example to retrieve a sense inventory for a
> certain word.Actually, the newest semeval task [Semeval15] will use
> [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> version, but the newest one is only available through REST.Also, in case it
> is needed to use a remote resource, AND it typically requires a license, we
> need to use a license key or just use the free quota with no key.
>
> Therefore, we thought of having a [ResourceProvider] as mentioned in the
> [draft].
> Are there any plans to add an external API connector of the sort or is
> this functionality already possible for extension ?
> (I noticed there is a [wikinews_importer] in the sanbox)
>
> But in any case we can always start working only locally as a first step,
> what do you think ?
> ===
> It would be more straightforward to use the algorithm names, so ok why not.
> ===
> Yes we have already started working !
> What do we need to push to the sandbox ?
> ===
>
> Thanks !
>
> Anthony
>
> [BabelNet] : http://babelnet.org/download
> [WordsAPI] : https://www.wordsapi.com/
> [Semeval15] : http://alt.qcri.org/semeval2015/task13/
> [draft] :
> https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
>
>
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> > Date: Mon, 1 Jun 2015 20:30:08 +0200
> >
> > Hello,
> >
> > I had a look at your APIs.
> >
> > Lets start with the WSDisambiguator. Should that be an interface?
> >
> > // returns the senses ordered by their score (best one first or only 1
> > in supervised case)
> > String[] disambiguate(String inputText,int inputWordposition);
> >
> > Shouldn't we have a tokenized input? Or is the inputText a token?
> >
> > If you have resources you could package those into OpenNLP models and
> > use the existing serialization support. Would that work for you?
> >
> > I think we should have different implementing classes for different
> > algorithms rather than grouping that in the Supervised and Unsupervised
> > classes. And also use the algorithm / approach name as part of the class
> > name.
> >
> > As far as I understand you already started to work on this. Should we an
> > initial code drop into the sandbox, and then work out things from there?
> > We strongly prefer to have as much as possible source code editing
> > history in our version control system.
> >
> > Jörn
>
>


RE: GSoC 2015 - WSD Module

2015-06-03 Thread Anthony Beylerian
Dear Jörn,

Thank you for the reply.===
Yes in the draft WSDisambiguator is the main interface.
===
Yes for the disambiguate method the input is expected to be tokenized, it 
should be an input array.
The second argument is for the token index.  We can also make it into an index 
array to support multiple words.
===
Concerning the resources, we expect two types of resources : local and remote 
resources.

+ For local resources, we have two main types :
1- training models for supervised techniques.
2- knowledge resources 

It could be best to make the packaging using similar OpenNLP models for #1.
As for #2, it will depend on what we want to use,  since the type of 
information depends on the specific technique.

+ As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need to 
have some REST support, for example to retrieve a sense inventory for a certain 
word.Actually, the newest semeval task [Semeval15] will use [BabelNet] for WSD 
and EL (Entity Linking).[BabelNet] has an offline version, but the newest one 
is only available through REST.Also, in case it is needed to use a remote 
resource, AND it typically requires a license, we need to use a license key or 
just use the free quota with no key.

Therefore, we thought of having a [ResourceProvider] as mentioned in the 
[draft]. 
Are there any plans to add an external API connector of the sort or is this 
functionality already possible for extension ?
(I noticed there is a [wikinews_importer] in the sanbox)

But in any case we can always start working only locally as a first step, what 
do you think ?
===
It would be more straightforward to use the algorithm names, so ok why not.
===
Yes we have already started working !
What do we need to push to the sandbox ?
===

Thanks !

Anthony 

[BabelNet] : http://babelnet.org/download
[WordsAPI] : https://www.wordsapi.com/
[Semeval15] : http://alt.qcri.org/semeval2015/task13/
[draft] : 
https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> Date: Mon, 1 Jun 2015 20:30:08 +0200
> 
> Hello,
> 
> I had a look at your APIs.
> 
> Lets start with the WSDisambiguator. Should that be an interface?
> 
> // returns the senses ordered by their score (best one first or only 1
> in supervised case)
> String[] disambiguate(String inputText,int inputWordposition);
> 
> Shouldn't we have a tokenized input? Or is the inputText a token?
> 
> If you have resources you could package those into OpenNLP models and
> use the existing serialization support. Would that work for you?
> 
> I think we should have different implementing classes for different
> algorithms rather than grouping that in the Supervised and Unsupervised
> classes. And also use the algorithm / approach name as part of the class
> name.
> 
> As far as I understand you already started to work on this. Should we an
> initial code drop into the sandbox, and then work out things from there?
> We strongly prefer to have as much as possible source code editing
> history in our version control system.
> 
> Jörn 
  

Re: GSoC 2015 - WSD Module

2015-06-01 Thread Joern Kottmann
Hello,

I had a look at your APIs.

Lets start with the WSDisambiguator. Should that be an interface?

// returns the senses ordered by their score (best one first or only 1
in supervised case)
String[] disambiguate(String inputText,int inputWordposition);

Shouldn't we have a tokenized input? Or is the inputText a token?

If you have resources you could package those into OpenNLP models and
use the existing serialization support. Would that work for you?

I think we should have different implementing classes for different
algorithms rather than grouping that in the Supervised and Unsupervised
classes. And also use the algorithm / approach name as part of the class
name.

As far as I understand you already started to work on this. Should we an
initial code drop into the sandbox, and then work out things from there?
We strongly prefer to have as much as possible source code editing
history in our version control system.

Jörn 

On Sat, 2015-05-23 at 01:44 +0900, Anthony Beylerian wrote:
> Hello,
> 
> Thank you for the feedback.
> 
> Please use this link to access a quick draft of the interface :
> https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
> 
> I believe the previously mentioned link was not allowing for document updates.
> 
> As for the common interface, since supervised methods rely on classifiers 
> they will need to load/save the training models, so we will need to separate 
> the two, maybe as in the draft.
> However we could keep a parent class with a common [disambiguate] method that 
> can be used for evaluation tasks and others.
> 
> Thanks !
> 
> Anthony
> 
> 
> 
> > Date: Fri, 22 May 2015 09:18:39 +0200
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> > 
> > Hello,
> > 
> > one of the tasks we should start is, is to define the interface for the WSD
> > component.
> > 
> > Please have a look at the other components in OpenNLP and try to propose an
> > interface in a similar style.
> > Can we use one interface for all the different implementations?
> > 
> > Jörn
> > 
> > 
> > On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi <
> > mondher.bouaz...@gmail.com> wrote:
> > 
> > > Dear all,
> > >
> > > Sorry if you received multiple copies of this email (The links were
> > > embedded). Here are the actual links:
> > >
> > > *Figure:*
> > >
> > > https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
> > > *Semeval/senseval results summary:*
> > >
> > > https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
> > > *Literature survey of WSD techniques:*
> > >
> > > https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing
> > >
> > > Yours faithfully
> > >
> > > On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian <
> > > anthonybeyler...@hotmail.com> wrote:
> > >
> > > > Please excuse the duplicate email, we could not attach the mentioned
> > > > figure.
> > > > Kindly find it here.
> > > > Thank you.
> > > >
> > > > From: anthonybeyler...@hotmail.com
> > > > To: dev@opennlp.apache.org
> > > > Subject: GSoC 2015 - WSD Module
> > > > Date: Mon, 18 May 2015 22:14:43 +0900
> > > >
> > > >
> > > >
> > > >
> > > > Dear all,
> > > > In the context of building a Word Sense Disambiguation (WSD) module,
> > > after
> > > > doing a survey on WSD techniques, we realized the following points :
> > > > - WSD techniques can be split into three sets (supervised,
> > > > unsupervised/knowledge based, hybrid) - WSD is used for different
> > > directly
> > > > related objectives such as all-words disambiguation, lexical sample
> > > > disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
> > > seem
> > > > to be good references to compare different techniques for WSD since many
> > > of
> > > > them were tested on the same data (but different one each event).- For
> > > the
> > > > sake of making a first solution, we propose to start with supporting the
> > > > "lexical sample" type of disambiguation, meaning to disambiguate
> > > > single/limited word(s) from an input text.
> > > > Therefore, we have decided to collect information about the different
> &

RE: GSoC 2015 - WSD Module

2015-05-22 Thread Anthony Beylerian
Hello,

Thank you for the feedback.

Please use this link to access a quick draft of the interface :
https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1

I believe the previously mentioned link was not allowing for document updates.

As for the common interface, since supervised methods rely on classifiers they 
will need to load/save the training models, so we will need to separate the 
two, maybe as in the draft.
However we could keep a parent class with a common [disambiguate] method that 
can be used for evaluation tasks and others.

Thanks !

Anthony



> Date: Fri, 22 May 2015 09:18:39 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> Hello,
> 
> one of the tasks we should start is, is to define the interface for the WSD
> component.
> 
> Please have a look at the other components in OpenNLP and try to propose an
> interface in a similar style.
> Can we use one interface for all the different implementations?
> 
> Jörn
> 
> 
> On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi <
> mondher.bouaz...@gmail.com> wrote:
> 
> > Dear all,
> >
> > Sorry if you received multiple copies of this email (The links were
> > embedded). Here are the actual links:
> >
> > *Figure:*
> >
> > https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
> > *Semeval/senseval results summary:*
> >
> > https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
> > *Literature survey of WSD techniques:*
> >
> > https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing
> >
> > Yours faithfully
> >
> > On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian <
> > anthonybeyler...@hotmail.com> wrote:
> >
> > > Please excuse the duplicate email, we could not attach the mentioned
> > > figure.
> > > Kindly find it here.
> > > Thank you.
> > >
> > > From: anthonybeyler...@hotmail.com
> > > To: dev@opennlp.apache.org
> > > Subject: GSoC 2015 - WSD Module
> > > Date: Mon, 18 May 2015 22:14:43 +0900
> > >
> > >
> > >
> > >
> > > Dear all,
> > > In the context of building a Word Sense Disambiguation (WSD) module,
> > after
> > > doing a survey on WSD techniques, we realized the following points :
> > > - WSD techniques can be split into three sets (supervised,
> > > unsupervised/knowledge based, hybrid) - WSD is used for different
> > directly
> > > related objectives such as all-words disambiguation, lexical sample
> > > disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
> > seem
> > > to be good references to compare different techniques for WSD since many
> > of
> > > them were tested on the same data (but different one each event).- For
> > the
> > > sake of making a first solution, we propose to start with supporting the
> > > "lexical sample" type of disambiguation, meaning to disambiguate
> > > single/limited word(s) from an input text.
> > > Therefore, we have decided to collect information about the different
> > > techniques in the literature (such as  references, performance,
> > parameters
> > > etc.) in this spreadsheet here.Otherwise we have also collected the
> > results
> > > of all the senseval/semeval exercises here.(Note that each document has
> > > many sheets)The collected results, could help decide on which techniques
> > to
> > > start with as main models for each set of techniques
> > > (supervised/unsupervised).
> > > We also propose a general approach for the package in the figure
> > > attached.The main components are as follows :
> > > 1- The different resources publicly available : WordNet, BabelNet,
> > > Wikipedia, etc.However, we would also like to allow the users to use
> > their
> > > own local resources, by maybe defining a type of connector to the
> > resource
> > > interface.
> > > 2- The resource interface will have the role to provide both a sense
> > > inventory that the user can query and a knowledge base (such as semantic
> > or
> > > syntactic info. etc.) that might be used depending on the technique.We
> > > might even later consider building a local cache for remote services.
> > > 3- The WSD algorithms/techniques themselves that will make use of the
> > > resource interface to access the resources required.These techniques will
> > > be split into two main packages as in the left side of the figure :
> > > Supervised/Unsupervised.The utils package includes common tools used in
> > > both types of techniques.The details mentioned in each package should be
> > > common to all implementations of these abstract models.
> > > 4- I/O could be processed in different formats (XML/JSON etc) or a
> > simpler
> > > structure following your recommendations.
> > > If you have any suggestions or recommendations, we would really
> > appreciate
> > > discussing them and would like your guidance to iterate on this tool-set.
> > > Best regards,
> > >
> > > Anthony Beylerian, Mondher Bouazizi
> > >
> >
  

Re: GSoC 2015 - WSD Module

2015-05-22 Thread Mondher Bouazizi
Hi all,

Thanks Rodrigo for the feedback.
I don't mind starting with IMS implementation as a first supervised
solution.
It seems to a good first step.
As for the SST, I will read more about it and will let you know.

On the other hand, how about the following interface Anthony and myself
prepared based on Jörn's recommendation.
We tried to be as close as possible to the other tools already implemented.

Link :
https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing

Best regards,

Mondher, Anthony



On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri  wrote:

> Hello Mondher (my response is about supervised WSD),
>
> Thanks for the info, it is quite interesting. Apart from the comment
> by Jörn, which I think is very important if we want to achieve
> something given the time constrains of the GSOC, I have a couple of
> recommendations/comments from my part:
>
> 1. Rather than targeting Lexical Sample task or all words WSD I think
> it could be more operative to choose an approach/algorithm and try to
> implement it in OpenNLP. One of the most (it not the most) popular
> approaches is the "it Makes Sense" (IMS) system
>
> http://www.comp.nus.edu.sg/~nlp/sw/README.txt
> https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf
>
> That I think is achievable in the GSOC time frame.
>
> 2. As an aside, research has been moving towards supersense tagging
> (SST), given the dificulty of WSD.
>
> http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf
>
> As you can see in the above paper, SST is approached as a sequence
> labelling task, rather than classification. This means that we could
> reimplement Ciaramita and Altun (2006) features implementing the
> AdaptiveFeatureGenerators and creating a module structurally similar
> to the NameFinder but for SST.
>
> This has also the advantage of being able to move to datasets that are
> not old Semcor and senseval and using current Tweet datasets and so
> on. See this recent paper on SST on tweets:
>
> http://aclweb.org/anthology/S14-1001
>
> I think that for supervised WSD, we should pursue option 1. or 2. and
> start definining the interface as Jörn has suggested.
>
> Best,
>
> Rodrigo
>
> On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
>  wrote:
> > Dear all,
> >
> > In the context of building a Word Sense Disambiguation (WSD) module,
> after
> > doing a survey on WSD techniques, we realized the following points :
> >
> > - WSD techniques can be split into three sets (supervised,
> > unsupervised/knowledge based, hybrid)
> >
> > - WSD is used for different directly related objectives such as all-words
> > disambiguation, lexical sample disambiguation, multi/cross-lingual
> > approaches etc.
> >
> > - Senseval/Semeval seem to be good references to compare different
> > techniques for WSD since many of them were tested on the same data (but
> > different one each event).
> >
> > - For the sake of making a first solution, we propose to start with
> > supporting the "lexical sample" type of disambiguation, meaning to
> > disambiguate single/limited word(s) from an input text.
> >
> >
> > Therefore, we have decided to collect information about the different
> > techniques in the literature (such as  references, performance,
> parameters
> > etc.) in this spreadsheet here.
> > Otherwise we have also collected the results of all the senseval/semeval
> > exercises here.
> > (Note that each document has many sheets)
> > The collected results, could help decide on which techniques to start
> with
> > as main models for each set of techniques (supervised/unsupervised).
> >
> > We also propose a general approach for the package in the figure
> attached.
> > The main components are as follows :
> >
> > 1- The different resources publicly available : WordNet, BabelNet,
> > Wikipedia, etc.
> > However, we would also like to allow the users to use their own local
> > resources, by maybe defining a type of connector to the resource
> interface.
> >
> > 2- The resource interface will have the role to provide both a sense
> > inventory that the user can query and a knowledge base (such as semantic
> or
> > syntactic info. etc.) that might be used depending on the technique.
> > We might even later consider building a local cache for remote services.
> >
> > 3- The WSD algorithms/techniques themselves that will make use of the
> > resource interface to access the resources required.
> > These techniques will be split into two main packages as in the left
> side of
> > the figure :  Supervised/Unsupervised.
> > The utils package includes common tools used in both types of techniques.
> > The details mentioned in each package should be common to all
> > implementations of these abstract models.
> >
> > 4- I/O could be processed in different formats (XML/JSON etc) or a
> simpler
> > structure following your recommendations.
> >
> > If you have any suggestions or recommendations, we would really
> appreciate
> > discussing them and would like your guidance to iterate on this 

Re: GSoC 2015 - WSD Module

2015-05-22 Thread Rodrigo Agerri
Hello Mondher (my response is about supervised WSD),

Thanks for the info, it is quite interesting. Apart from the comment
by Jörn, which I think is very important if we want to achieve
something given the time constrains of the GSOC, I have a couple of
recommendations/comments from my part:

1. Rather than targeting Lexical Sample task or all words WSD I think
it could be more operative to choose an approach/algorithm and try to
implement it in OpenNLP. One of the most (it not the most) popular
approaches is the "it Makes Sense" (IMS) system

http://www.comp.nus.edu.sg/~nlp/sw/README.txt
https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

That I think is achievable in the GSOC time frame.

2. As an aside, research has been moving towards supersense tagging
(SST), given the dificulty of WSD.

http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

As you can see in the above paper, SST is approached as a sequence
labelling task, rather than classification. This means that we could
reimplement Ciaramita and Altun (2006) features implementing the
AdaptiveFeatureGenerators and creating a module structurally similar
to the NameFinder but for SST.

This has also the advantage of being able to move to datasets that are
not old Semcor and senseval and using current Tweet datasets and so
on. See this recent paper on SST on tweets:

http://aclweb.org/anthology/S14-1001

I think that for supervised WSD, we should pursue option 1. or 2. and
start definining the interface as Jörn has suggested.

Best,

Rodrigo

On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
 wrote:
> Dear all,
>
> In the context of building a Word Sense Disambiguation (WSD) module, after
> doing a survey on WSD techniques, we realized the following points :
>
> - WSD techniques can be split into three sets (supervised,
> unsupervised/knowledge based, hybrid)
>
> - WSD is used for different directly related objectives such as all-words
> disambiguation, lexical sample disambiguation, multi/cross-lingual
> approaches etc.
>
> - Senseval/Semeval seem to be good references to compare different
> techniques for WSD since many of them were tested on the same data (but
> different one each event).
>
> - For the sake of making a first solution, we propose to start with
> supporting the "lexical sample" type of disambiguation, meaning to
> disambiguate single/limited word(s) from an input text.
>
>
> Therefore, we have decided to collect information about the different
> techniques in the literature (such as  references, performance, parameters
> etc.) in this spreadsheet here.
> Otherwise we have also collected the results of all the senseval/semeval
> exercises here.
> (Note that each document has many sheets)
> The collected results, could help decide on which techniques to start with
> as main models for each set of techniques (supervised/unsupervised).
>
> We also propose a general approach for the package in the figure attached.
> The main components are as follows :
>
> 1- The different resources publicly available : WordNet, BabelNet,
> Wikipedia, etc.
> However, we would also like to allow the users to use their own local
> resources, by maybe defining a type of connector to the resource interface.
>
> 2- The resource interface will have the role to provide both a sense
> inventory that the user can query and a knowledge base (such as semantic or
> syntactic info. etc.) that might be used depending on the technique.
> We might even later consider building a local cache for remote services.
>
> 3- The WSD algorithms/techniques themselves that will make use of the
> resource interface to access the resources required.
> These techniques will be split into two main packages as in the left side of
> the figure :  Supervised/Unsupervised.
> The utils package includes common tools used in both types of techniques.
> The details mentioned in each package should be common to all
> implementations of these abstract models.
>
> 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
> structure following your recommendations.
>
> If you have any suggestions or recommendations, we would really appreciate
> discussing them and would like your guidance to iterate on this tool-set.
>
> Best regards,
>
> Anthony Beylerian, Mondher Bouazizi


Re: GSoC 2015 - WSD Module

2015-05-22 Thread Joern Kottmann
Hello,

one of the tasks we should start is, is to define the interface for the WSD
component.

Please have a look at the other components in OpenNLP and try to propose an
interface in a similar style.
Can we use one interface for all the different implementations?

Jörn


On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi <
mondher.bouaz...@gmail.com> wrote:

> Dear all,
>
> Sorry if you received multiple copies of this email (The links were
> embedded). Here are the actual links:
>
> *Figure:*
>
> https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
> *Semeval/senseval results summary:*
>
> https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
> *Literature survey of WSD techniques:*
>
> https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing
>
> Yours faithfully
>
> On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian <
> anthonybeyler...@hotmail.com> wrote:
>
> > Please excuse the duplicate email, we could not attach the mentioned
> > figure.
> > Kindly find it here.
> > Thank you.
> >
> > From: anthonybeyler...@hotmail.com
> > To: dev@opennlp.apache.org
> > Subject: GSoC 2015 - WSD Module
> > Date: Mon, 18 May 2015 22:14:43 +0900
> >
> >
> >
> >
> > Dear all,
> > In the context of building a Word Sense Disambiguation (WSD) module,
> after
> > doing a survey on WSD techniques, we realized the following points :
> > - WSD techniques can be split into three sets (supervised,
> > unsupervised/knowledge based, hybrid) - WSD is used for different
> directly
> > related objectives such as all-words disambiguation, lexical sample
> > disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
> seem
> > to be good references to compare different techniques for WSD since many
> of
> > them were tested on the same data (but different one each event).- For
> the
> > sake of making a first solution, we propose to start with supporting the
> > "lexical sample" type of disambiguation, meaning to disambiguate
> > single/limited word(s) from an input text.
> > Therefore, we have decided to collect information about the different
> > techniques in the literature (such as  references, performance,
> parameters
> > etc.) in this spreadsheet here.Otherwise we have also collected the
> results
> > of all the senseval/semeval exercises here.(Note that each document has
> > many sheets)The collected results, could help decide on which techniques
> to
> > start with as main models for each set of techniques
> > (supervised/unsupervised).
> > We also propose a general approach for the package in the figure
> > attached.The main components are as follows :
> > 1- The different resources publicly available : WordNet, BabelNet,
> > Wikipedia, etc.However, we would also like to allow the users to use
> their
> > own local resources, by maybe defining a type of connector to the
> resource
> > interface.
> > 2- The resource interface will have the role to provide both a sense
> > inventory that the user can query and a knowledge base (such as semantic
> or
> > syntactic info. etc.) that might be used depending on the technique.We
> > might even later consider building a local cache for remote services.
> > 3- The WSD algorithms/techniques themselves that will make use of the
> > resource interface to access the resources required.These techniques will
> > be split into two main packages as in the left side of the figure :
> > Supervised/Unsupervised.The utils package includes common tools used in
> > both types of techniques.The details mentioned in each package should be
> > common to all implementations of these abstract models.
> > 4- I/O could be processed in different formats (XML/JSON etc) or a
> simpler
> > structure following your recommendations.
> > If you have any suggestions or recommendations, we would really
> appreciate
> > discussing them and would like your guidance to iterate on this tool-set.
> > Best regards,
> >
> > Anthony Beylerian, Mondher Bouazizi
> >
>


Re: GSoC 2015 - WSD Module

2015-05-18 Thread Mondher Bouazizi
Dear all,

Sorry if you received multiple copies of this email (The links were
embedded). Here are the actual links:

*Figure:*
https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
*Semeval/senseval results summary:*
https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
*Literature survey of WSD techniques:*
https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

Yours faithfully

On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Please excuse the duplicate email, we could not attach the mentioned
> figure.
> Kindly find it here.
> Thank you.
>
> From: anthonybeyler...@hotmail.com
> To: dev@opennlp.apache.org
> Subject: GSoC 2015 - WSD Module
> Date: Mon, 18 May 2015 22:14:43 +0900
>
>
>
>
> Dear all,
> In the context of building a Word Sense Disambiguation (WSD) module, after
> doing a survey on WSD techniques, we realized the following points :
> - WSD techniques can be split into three sets (supervised,
> unsupervised/knowledge based, hybrid) - WSD is used for different directly
> related objectives such as all-words disambiguation, lexical sample
> disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem
> to be good references to compare different techniques for WSD since many of
> them were tested on the same data (but different one each event).- For the
> sake of making a first solution, we propose to start with supporting the
> "lexical sample" type of disambiguation, meaning to disambiguate
> single/limited word(s) from an input text.
> Therefore, we have decided to collect information about the different
> techniques in the literature (such as  references, performance, parameters
> etc.) in this spreadsheet here.Otherwise we have also collected the results
> of all the senseval/semeval exercises here.(Note that each document has
> many sheets)The collected results, could help decide on which techniques to
> start with as main models for each set of techniques
> (supervised/unsupervised).
> We also propose a general approach for the package in the figure
> attached.The main components are as follows :
> 1- The different resources publicly available : WordNet, BabelNet,
> Wikipedia, etc.However, we would also like to allow the users to use their
> own local resources, by maybe defining a type of connector to the resource
> interface.
> 2- The resource interface will have the role to provide both a sense
> inventory that the user can query and a knowledge base (such as semantic or
> syntactic info. etc.) that might be used depending on the technique.We
> might even later consider building a local cache for remote services.
> 3- The WSD algorithms/techniques themselves that will make use of the
> resource interface to access the resources required.These techniques will
> be split into two main packages as in the left side of the figure :
> Supervised/Unsupervised.The utils package includes common tools used in
> both types of techniques.The details mentioned in each package should be
> common to all implementations of these abstract models.
> 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
> structure following your recommendations.
> If you have any suggestions or recommendations, we would really appreciate
> discussing them and would like your guidance to iterate on this tool-set.
> Best regards,
>
> Anthony Beylerian, Mondher Bouazizi
>


RE: GSoC 2015 - WSD Module

2015-05-18 Thread Anthony Beylerian
Please excuse the duplicate email, we could not attach the mentioned figure. 
Kindly find it here.
Thank you.

From: anthonybeyler...@hotmail.com
To: dev@opennlp.apache.org
Subject: GSoC 2015 - WSD Module
Date: Mon, 18 May 2015 22:14:43 +0900




Dear all,
In the context of building a Word Sense Disambiguation (WSD) module, after 
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised, 
unsupervised/knowledge based, hybrid) - WSD is used for different directly 
related objectives such as all-words disambiguation, lexical sample 
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to 
be good references to compare different techniques for WSD since many of them 
were tested on the same data (but different one each event).- For the sake of 
making a first solution, we propose to start with supporting the "lexical 
sample" type of disambiguation, meaning to disambiguate single/limited word(s) 
from an input text.
Therefore, we have decided to collect information about the different 
techniques in the literature (such as  references, performance, parameters 
etc.) in this spreadsheet here.Otherwise we have also collected the results of 
all the senseval/semeval exercises here.(Note that each document has many 
sheets)The collected results, could help decide on which techniques to start 
with as main models for each set of techniques (supervised/unsupervised).
We also propose a general approach for the package in the figure attached.The 
main components are as follows : 
1- The different resources publicly available : WordNet, BabelNet, Wikipedia, 
etc.However, we would also like to allow the users to use their own local 
resources, by maybe defining a type of connector to the resource interface.
2- The resource interface will have the role to provide both a sense inventory 
that the user can query and a knowledge base (such as semantic or syntactic 
info. etc.) that might be used depending on the technique.We might even later 
consider building a local cache for remote services. 
3- The WSD algorithms/techniques themselves that will make use of the resource 
interface to access the resources required.These techniques will be split into 
two main packages as in the left side of the figure :  
Supervised/Unsupervised.The utils package includes common tools used in both 
types of techniques.The details mentioned in each package should be common to 
all implementations of these abstract models.
4- I/O could be processed in different formats (XML/JSON etc) or a simpler 
structure following your recommendations.
If you have any suggestions or recommendations, we would really appreciate 
discussing them and would like your guidance to iterate on this tool-set.
Best regards,

Anthony Beylerian, Mondher Bouazizi 
  

GSoC 2015 - WSD Module

2015-05-18 Thread Anthony Beylerian
Dear all,
In the context of building a Word Sense Disambiguation (WSD) module, after 
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised, 
unsupervised/knowledge based, hybrid) - WSD is used for different directly 
related objectives such as all-words disambiguation, lexical sample 
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to 
be good references to compare different techniques for WSD since many of them 
were tested on the same data (but different one each event).- For the sake of 
making a first solution, we propose to start with supporting the "lexical 
sample" type of disambiguation, meaning to disambiguate single/limited word(s) 
from an input text.
Therefore, we have decided to collect information about the different 
techniques in the literature (such as  references, performance, parameters 
etc.) in this spreadsheet here.Otherwise we have also collected the results of 
all the senseval/semeval exercises here.(Note that each document has many 
sheets)The collected results, could help decide on which techniques to start 
with as main models for each set of techniques (supervised/unsupervised).
We also propose a general approach for the package in the figure attached.The 
main components are as follows : 
1- The different resources publicly available : WordNet, BabelNet, Wikipedia, 
etc.However, we would also like to allow the users to use their own local 
resources, by maybe defining a type of connector to the resource interface.
2- The resource interface will have the role to provide both a sense inventory 
that the user can query and a knowledge base (such as semantic or syntactic 
info. etc.) that might be used depending on the technique.We might even later 
consider building a local cache for remote services. 
3- The WSD algorithms/techniques themselves that will make use of the resource 
interface to access the resources required.These techniques will be split into 
two main packages as in the left side of the figure :  
Supervised/Unsupervised.The utils package includes common tools used in both 
types of techniques.The details mentioned in each package should be common to 
all implementations of these abstract models.
4- I/O could be processed in different formats (XML/JSON etc) or a simpler 
structure following your recommendations.
If you have any suggestions or recommendations, we would really appreciate 
discussing them and would like your guidance to iterate on this tool-set.
Best regards,

Anthony Beylerian, Mondher Bouazizi