Re: WSD - Supervised techniques

Mondher Bouazizi Tue, 14 Jul 2015 08:54:09 -0700

Dear all,

Thank you Anthony for the detailed explanation.


Regarding the parser/converter classes that Anthony mentioned, I think it
would be a better idea to make an independent component in OpenNLP that
processes Semcor data. NLTK [1] for example, which is a Python library for
natural language processing contains a component to read Semcor data [2],
and which can be used by the other components (not only the WSD one).

For now, I am using MIT Jsemcor [3] (which is MIT licensed) to read semcor
files , but as soon as I finish the implementation of the remaining parts
of IMS (all words wsd [4] / coarse grained vs fine grained), I'll be
implementing our own semcor reader.

On the other hand, for now, I will clean the code of IMS and make it
independent from the format of the source of training data. All data will
pass through a connector. I will run the first tests using semcor data: The
evaluator will use semcor data for training, and the ones collected from
Senseval-3 for test (to compare the different approaches implemented).

Also, please watch the issues ([4]-[9]) so you can get updates each time we
add a patch for each component. Thanks.

Best regards,

Mondher



[1] http://www.nltk.org/api/nltk.html
[2]
http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.semcor
[3] http://projects.csail.mit.edu/jsemcor/
[4] https://issues.apache.org/jira/browse/OPENNLP-797
[5] https://issues.apache.org/jira/browse/OPENNLP-789
[6] https://issues.apache.org/jira/browse/OPENNLP-790
[7] https://issues.apache.org/jira/browse/OPENNLP-794
[8] https://issues.apache.org/jira/browse/OPENNLP-795
[9] https://issues.apache.org/jira/browse/OPENNLP-796

On Tue, Jul 14, 2015 at 1:54 AM, Anthony Beylerian <
[email protected]> wrote:

> Dear Rodrigo,
>
> Thank you for the feedback.
>
> I have added [1][2][3] issues regarding the below.
>
> Concerning the testers (IMSTester etc) they should be in src/test/java/....
> We can add docs in those to explain how to use each implementation.
>
> Actually, I am using the parser for Senseval3 that Mondher mentionedin
> [LeskEvaluatorTest], the functionality was included in DataExtractor.
> I believe it would be best to separate that and have two parser/converter
> classes of the sort :
>
> disambiguator.reader.SemCorReader,
> disambiguator.reader.SensevalReader.
>
> That should be clearer, what do you think ?
>
> Anthony
>
> [1]: https://issues.apache.org/jira/browse/OPENNLP-794
> [2]: https://issues.apache.org/jira/browse/OPENNLP-795[3]:
> https://issues.apache.org/jira/browse/OPENNLP-796
>
> > From: [email protected]
> > Date: Mon, 13 Jul 2015 15:50:00 +0200
> > Subject: Re: WSD - Supervised techniques
> > To: [email protected]
> >
> > Hello,
> >
> > It has been few public activity these last days. We believe that it is
> > very important to step up in two directions wrt what is already commited
> in svn:
> >
> > 1. Finishing the WSDEvaluator
> > 2. Provide the classes required to run the WSD tools from the CLI as
> > any other component.
> > 3. Formats: it will be interesting to have at least conversor for the
> > most common dataset used for evaluation and training. E.g., semcor and
> > senseval-3. You have mentioned that a conversor was already
> > implemented but I cannot find it in svn.
> > 4. Write the documentation so that future users (and other dev members
> > here) can test the component.
> >
> > These comments were general for both unsupervised and supervised WSD.
> > Specific to supervised WSD:
> >
> > 5. IMS: you mention in your previous email that the lexical sample
> > part is done and that you need to finish the all words IMS
> > implementation. If this is the case, a JIRA issue should be open about
> > it and make it a priority.
> > Incidentally, I cannot find the IMSTester you mentioned in the email.
> >
> > There is an issue already there for the Evaluator (OPENNLP-790) but I
> > think that each of the remaining tasks require their JIRA issues
> > (these issue has pending unused imports, variables and other things).
> >
> > The aim before GSOC ends should be to have the best chance of having the
> > WSDcomponent as a good candidate for its integration in the opennlp
> > tools. Also, by being able to test it  we can see the actual state of
> > the component with respect to performance in the usual datasets.
> >
> > Can you please create such issues in JIRA and start addressing them
> separately?
> >
> > Thanks,
> >
> > Rodrigo
> >
> >
> >
> > On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi
> > <[email protected]> wrote:
> > > Hi everyone,
> > >
> > > I finished the first iteration of IMS approach for lexical sample
> > > disambiguation. Please find the patch uploaded on the jira issue [1]. I
> > > also created a tester (IMSTester) to run it.
> > >
> > > As I mentioned before, the approach is as follows: each time, the
> module is
> > > called to disambiguate a word, it first check if the model file for
> that
> > > word exists.
> > >
> > > 1- If the "model" file exists, it is used to disambiguate the word
> > >
> > > 2- Otherwise, if the file does not exist, the module checks if the
> training
> > > data file for that word exists. If it does, the xml file data will be
> used
> > > to train the model and create the model file.
> > >
> > > 3- If no training data exist, the most frequent sense (mfs) in WordNet
> is
> > > returned.
> > >
> > > For now I am using the training data I collected from Senseval and
> Semeval
> > > websites. However, I am currently checking semcore to use it as a main
> > > reference.
> > >
> > > Yours sincerely,
> > >
> > > Mondher
> > >
> > > [1] https://issues.apache.org/jira/browse/OPENNLP-757
> > >
> > >
> > >
> > > On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann <[email protected]>
> wrote:
> > >
> > >> On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote:
> > >> > Hi,
> > >> >
> > >> > Actually I have finished the implementation of most of the parts of
> the
> > >> IMS
> > >> > approach. I also made a parser for the Senseval-3 data.
> > >> >
> > >> > However I am currently working on two main points:
> > >> >
> > >> > - I am trying to figure out how to use the MaxEnt classifier.
> > >> Unfortunately
> > >> > there is no enough documentation, so I am trying to see how it is
> used by
> > >> > the other components of OpenNLP. Any recommendation ?
> > >>
> > >> Yes, have a look at the doccat component. It should be easy to
> > >> understand from it how it works. The classifier has to be trained with
> > >> an event (outcome and features) and can then classify a set of
> features
> > >> in the categories it has seen before as outcome.
> > >>
> > >> Jörn
> > >>
>

Re: WSD - Supervised techniques

Reply via email to