Dear all, Thank you Anthony for the detailed explanation.
Regarding the parser/converter classes that Anthony mentioned, I think it would be a better idea to make an independent component in OpenNLP that processes Semcor data. NLTK [1] for example, which is a Python library for natural language processing contains a component to read Semcor data [2], and which can be used by the other components (not only the WSD one). For now, I am using MIT Jsemcor [3] (which is MIT licensed) to read semcor files , but as soon as I finish the implementation of the remaining parts of IMS (all words wsd [4] / coarse grained vs fine grained), I'll be implementing our own semcor reader. On the other hand, for now, I will clean the code of IMS and make it independent from the format of the source of training data. All data will pass through a connector. I will run the first tests using semcor data: The evaluator will use semcor data for training, and the ones collected from Senseval-3 for test (to compare the different approaches implemented). Also, please watch the issues ([4]-[9]) so you can get updates each time we add a patch for each component. Thanks. Best regards, Mondher [1] http://www.nltk.org/api/nltk.html [2] http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.semcor [3] http://projects.csail.mit.edu/jsemcor/ [4] https://issues.apache.org/jira/browse/OPENNLP-797 [5] https://issues.apache.org/jira/browse/OPENNLP-789 [6] https://issues.apache.org/jira/browse/OPENNLP-790 [7] https://issues.apache.org/jira/browse/OPENNLP-794 [8] https://issues.apache.org/jira/browse/OPENNLP-795 [9] https://issues.apache.org/jira/browse/OPENNLP-796 On Tue, Jul 14, 2015 at 1:54 AM, Anthony Beylerian < [email protected]> wrote: > Dear Rodrigo, > > Thank you for the feedback. > > I have added [1][2][3] issues regarding the below. > > Concerning the testers (IMSTester etc) they should be in src/test/java/.... > We can add docs in those to explain how to use each implementation. > > Actually, I am using the parser for Senseval3 that Mondher mentionedin > [LeskEvaluatorTest], the functionality was included in DataExtractor. > I believe it would be best to separate that and have two parser/converter > classes of the sort : > > disambiguator.reader.SemCorReader, > disambiguator.reader.SensevalReader. > > That should be clearer, what do you think ? > > Anthony > > [1]: https://issues.apache.org/jira/browse/OPENNLP-794 > [2]: https://issues.apache.org/jira/browse/OPENNLP-795[3]: > https://issues.apache.org/jira/browse/OPENNLP-796 > > > From: [email protected] > > Date: Mon, 13 Jul 2015 15:50:00 +0200 > > Subject: Re: WSD - Supervised techniques > > To: [email protected] > > > > Hello, > > > > It has been few public activity these last days. We believe that it is > > very important to step up in two directions wrt what is already commited > in svn: > > > > 1. Finishing the WSDEvaluator > > 2. Provide the classes required to run the WSD tools from the CLI as > > any other component. > > 3. Formats: it will be interesting to have at least conversor for the > > most common dataset used for evaluation and training. E.g., semcor and > > senseval-3. You have mentioned that a conversor was already > > implemented but I cannot find it in svn. > > 4. Write the documentation so that future users (and other dev members > > here) can test the component. > > > > These comments were general for both unsupervised and supervised WSD. > > Specific to supervised WSD: > > > > 5. IMS: you mention in your previous email that the lexical sample > > part is done and that you need to finish the all words IMS > > implementation. If this is the case, a JIRA issue should be open about > > it and make it a priority. > > Incidentally, I cannot find the IMSTester you mentioned in the email. > > > > There is an issue already there for the Evaluator (OPENNLP-790) but I > > think that each of the remaining tasks require their JIRA issues > > (these issue has pending unused imports, variables and other things). > > > > The aim before GSOC ends should be to have the best chance of having the > > WSDcomponent as a good candidate for its integration in the opennlp > > tools. Also, by being able to test it we can see the actual state of > > the component with respect to performance in the usual datasets. > > > > Can you please create such issues in JIRA and start addressing them > separately? > > > > Thanks, > > > > Rodrigo > > > > > > > > On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi > > <[email protected]> wrote: > > > Hi everyone, > > > > > > I finished the first iteration of IMS approach for lexical sample > > > disambiguation. Please find the patch uploaded on the jira issue [1]. I > > > also created a tester (IMSTester) to run it. > > > > > > As I mentioned before, the approach is as follows: each time, the > module is > > > called to disambiguate a word, it first check if the model file for > that > > > word exists. > > > > > > 1- If the "model" file exists, it is used to disambiguate the word > > > > > > 2- Otherwise, if the file does not exist, the module checks if the > training > > > data file for that word exists. If it does, the xml file data will be > used > > > to train the model and create the model file. > > > > > > 3- If no training data exist, the most frequent sense (mfs) in WordNet > is > > > returned. > > > > > > For now I am using the training data I collected from Senseval and > Semeval > > > websites. However, I am currently checking semcore to use it as a main > > > reference. > > > > > > Yours sincerely, > > > > > > Mondher > > > > > > [1] https://issues.apache.org/jira/browse/OPENNLP-757 > > > > > > > > > > > > On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann <[email protected]> > wrote: > > > > > >> On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote: > > >> > Hi, > > >> > > > >> > Actually I have finished the implementation of most of the parts of > the > > >> IMS > > >> > approach. I also made a parser for the Senseval-3 data. > > >> > > > >> > However I am currently working on two main points: > > >> > > > >> > - I am trying to figure out how to use the MaxEnt classifier. > > >> Unfortunately > > >> > there is no enough documentation, so I am trying to see how it is > used by > > >> > the other components of OpenNLP. Any recommendation ? > > >> > > >> Yes, have a look at the doccat component. It should be easy to > > >> understand from it how it works. The classifier has to be trained with > > >> an event (outcome and features) and can then classify a set of > features > > >> in the categories it has seen before as outcome. > > >> > > >> Jörn > > >> >
