Re: GSoC 2015 - WSD Module

2015-06-05 Thread Joern Kottmann
Hello,

yes, wordnet is fine, we already depend on it. I just think that remote
resources are particular problematic.

For local resources it boils down to their license.

Here is the wordnet one:
http://wordnet.princeton.edu/wordnet/license/

We might even be able to redistribute this here at Apache, which is really
nice. To do that we have to check
with the legal list if they give a green light for it.

You can get more information about licenses and dependencies for Apache
projects here:
http://www.apache.org/legal/resolved.html#category-a
http://www.apache.org/legal/resolved.html#category-b
http://www.apache.org/legal/resolved.html#category-x

Are the things you have to clean up of the nature that you couldn't do that
after you send in a patch?
This could be removal of code which can be released under ASL.

We would like to get you integrated into the way we work here as quickly as
possible.

That includes:
- Tasks are planned/tracked via jira (this allows other people to
comment/follow)
- We would like to be able to review your code and maybe give some advice
(commit often, break things down in tasks)
- Changes or new features are usually discussed a on the dev list (e.g. a
short write up about the approaches you implemented
  or better plan to implement)

Jörn




On Fri, Jun 5, 2015 at 2:24 PM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Hi,
>
> We understand the issues.
>
> So just to make sure, we are currently relying on JWNL to access WordNet
> as a resource. Is that fine for now ?
>
> In case we need to avoid such dependencies,  would it be ok to create a
> resource file that includes what we need extracted from it or also from
> other resources combined (sense inventory, word relationships and so on) ?
> We'd like your recommendation.
>
> Also we are currently cleaning up the project and will upload a patch.
> To sum up, we have already implemented the Lesk approach, as well as parts
> of the supervised IMS approach (preprocessing, feature extraction).
> Next, we will implement the baseline techniques and collect the training
> data that will be used by supervised approaches.
> Files will be collected from different sources and will be unified in a
> single model file.
> Best regards,
>
> Anthony, Mondher
>
>
> > Date: Wed, 3 Jun 2015 16:47:50 +0200
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> >
> > We should not use remote resources. A remote service adds severe limits
> to
> > the WSD component. A remote resource will be slow to query (compared to
> > disk or memory), queries might be expensive (pay per request), the
> license
> > might not allow usage in a way the ASL promises to our users. Another
> issue
> > is that calling a remote service might leak the document text itself to
> > that remote service.
> >
> > Please attach a patch to the jira issue, and then we can pull it into the
> > sandbox.
> >
> > Jörn
> >
> >
> >
> >
> >
> > On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
> > anthonybeyler...@hotmail.com> wrote:
> >
> > > Dear Jörn,
> > >
> > > Thank you for the reply.===
> > > Yes in the draft WSDisambiguator is the main interface.
> > > ===
> > > Yes for the disambiguate method the input is expected to be tokenized,
> it
> > > should be an input array.
> > > The second argument is for the token index.  We can also make it into
> an
> > > index array to support multiple words.
> > > ===
> > > Concerning the resources, we expect two types of resources : local and
> > > remote resources.
> > >
> > > + For local resources, we have two main types :
> > > 1- training models for supervised techniques.
> > > 2- knowledge resources
> > >
> > > It could be best to make the packaging using similar OpenNLP models
> for #1.
> > > As for #2, it will depend on what we want to use,  since the type of
> > > information depends on the specific technique.
> > >
> > > + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might
> need
> > > to have some REST support, for example to retrieve a sense inventory
> for a
> > > certain word.Actually, the newest semeval task [Semeval15] will use
> > > [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> > > version, but the newest one is only available through REST.Also, in
> case it
> > > is needed to use a remote resource, AND it typically requires a
> license, we
> > > need to use a license key or just use the free quota with no key.
> > >
> > > Therefore, we thought of having a [ResourceProvider] as mentioned in
> the
> > > [draft].
> > > Are there any plans to add an external API connector of the sort or is
> > > this functionality already possible for extension ?
> > > (I noticed there is a [wikinews_importer] in the sanbox)
> > >
> > > But in any case we can always start working only locally as a first
> step,
> > > what do you think ?
> > > 

RE: GSoC 2015 - WSD Module

2015-06-05 Thread Anthony Beylerian
Hi,

We understand the issues.

So just to make sure, we are currently relying on JWNL to access WordNet as a 
resource. Is that fine for now ? 

In case we need to avoid such dependencies,  would it be ok to create a 
resource file that includes what we need extracted from it or also from other 
resources combined (sense inventory, word relationships and so on) ?
We'd like your recommendation.

Also we are currently cleaning up the project and will upload a patch.
To sum up, we have already implemented the Lesk approach, as well as parts of 
the supervised IMS approach (preprocessing, feature extraction).
Next, we will implement the baseline techniques and collect the training data 
that will be used by supervised approaches.
Files will be collected from different sources and will be unified in a single 
model file.
Best regards,

Anthony, Mondher


> Date: Wed, 3 Jun 2015 16:47:50 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> We should not use remote resources. A remote service adds severe limits to
> the WSD component. A remote resource will be slow to query (compared to
> disk or memory), queries might be expensive (pay per request), the license
> might not allow usage in a way the ASL promises to our users. Another issue
> is that calling a remote service might leak the document text itself to
> that remote service.
> 
> Please attach a patch to the jira issue, and then we can pull it into the
> sandbox.
> 
> Jörn
> 
> 
> 
> 
> 
> On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
> anthonybeyler...@hotmail.com> wrote:
> 
> > Dear Jörn,
> >
> > Thank you for the reply.===
> > Yes in the draft WSDisambiguator is the main interface.
> > ===
> > Yes for the disambiguate method the input is expected to be tokenized, it
> > should be an input array.
> > The second argument is for the token index.  We can also make it into an
> > index array to support multiple words.
> > ===
> > Concerning the resources, we expect two types of resources : local and
> > remote resources.
> >
> > + For local resources, we have two main types :
> > 1- training models for supervised techniques.
> > 2- knowledge resources
> >
> > It could be best to make the packaging using similar OpenNLP models for #1.
> > As for #2, it will depend on what we want to use,  since the type of
> > information depends on the specific technique.
> >
> > + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
> > to have some REST support, for example to retrieve a sense inventory for a
> > certain word.Actually, the newest semeval task [Semeval15] will use
> > [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> > version, but the newest one is only available through REST.Also, in case it
> > is needed to use a remote resource, AND it typically requires a license, we
> > need to use a license key or just use the free quota with no key.
> >
> > Therefore, we thought of having a [ResourceProvider] as mentioned in the
> > [draft].
> > Are there any plans to add an external API connector of the sort or is
> > this functionality already possible for extension ?
> > (I noticed there is a [wikinews_importer] in the sanbox)
> >
> > But in any case we can always start working only locally as a first step,
> > what do you think ?
> > ===
> > It would be more straightforward to use the algorithm names, so ok why not.
> > ===
> > Yes we have already started working !
> > What do we need to push to the sandbox ?
> > ===
> >
> > Thanks !
> >
> > Anthony
> >
> > [BabelNet] : http://babelnet.org/download
> > [WordsAPI] : https://www.wordsapi.com/
> > [Semeval15] : http://alt.qcri.org/semeval2015/task13/
> > [draft] :
> > https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
> >
> >
> > > Subject: Re: GSoC 2015 - WSD Module
> > > From: kottm...@gmail.com
> > > To: dev@opennlp.apache.org
> > > Date: Mon, 1 Jun 2015 20:30:08 +0200
> > >
> > > Hello,
> > >
> > > I had a look at your APIs.
> > >
> > > Lets start with the WSDisambiguator. Should that be an interface?
> > >
> > > // returns the senses ordered by their score (best one first or only 1
> > > in supervised case)
> > > String[] disambiguate(String inputText,int inputWordposition);
> > >
> > > Shouldn't we have a tokenized input? Or is the inputText a token?
> > >
> > > If you have resources you could package those into OpenNLP models and
> > > use the existing serialization support. Would that work for you?
> > >
> > > I think we should have different implementing classes for different
> > > algorithms rather than grouping that in the Supervised and Unsupervised
> > > classes. And also use the algorithm / approach name as part of the class
> > > name.
> > >
> > > As far as I understand you