Hello, It has been few public activity these last days. We believe that it is very important to step up in two directions wrt what is already commited in svn:
1. Finishing the WSDEvaluator 2. Provide the classes required to run the WSD tools from the CLI as any other component. 3. Formats: it will be interesting to have at least conversor for the most common dataset used for evaluation and training. E.g., semcor and senseval-3. You have mentioned that a conversor was already implemented but I cannot find it in svn. 4. Write the documentation so that future users (and other dev members here) can test the component. These comments were general for both unsupervised and supervised WSD. Specific to supervised WSD: 5. IMS: you mention in your previous email that the lexical sample part is done and that you need to finish the all words IMS implementation. If this is the case, a JIRA issue should be open about it and make it a priority. Incidentally, I cannot find the IMSTester you mentioned in the email. There is an issue already there for the Evaluator (OPENNLP-790) but I think that each of the remaining tasks require their JIRA issues (these issue has pending unused imports, variables and other things). The aim before GSOC ends should be to have the best chance of having the WSDcomponent as a good candidate for its integration in the opennlp tools. Also, by being able to test it we can see the actual state of the component with respect to performance in the usual datasets. Can you please create such issues in JIRA and start addressing them separately? Thanks, Rodrigo On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi <[email protected]> wrote: > Hi everyone, > > I finished the first iteration of IMS approach for lexical sample > disambiguation. Please find the patch uploaded on the jira issue [1]. I > also created a tester (IMSTester) to run it. > > As I mentioned before, the approach is as follows: each time, the module is > called to disambiguate a word, it first check if the model file for that > word exists. > > 1- If the "model" file exists, it is used to disambiguate the word > > 2- Otherwise, if the file does not exist, the module checks if the training > data file for that word exists. If it does, the xml file data will be used > to train the model and create the model file. > > 3- If no training data exist, the most frequent sense (mfs) in WordNet is > returned. > > For now I am using the training data I collected from Senseval and Semeval > websites. However, I am currently checking semcore to use it as a main > reference. > > Yours sincerely, > > Mondher > > [1] https://issues.apache.org/jira/browse/OPENNLP-757 > > > > On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann <[email protected]> wrote: > >> On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote: >> > Hi, >> > >> > Actually I have finished the implementation of most of the parts of the >> IMS >> > approach. I also made a parser for the Senseval-3 data. >> > >> > However I am currently working on two main points: >> > >> > - I am trying to figure out how to use the MaxEnt classifier. >> Unfortunately >> > there is no enough documentation, so I am trying to see how it is used by >> > the other components of OpenNLP. Any recommendation ? >> >> Yes, have a look at the doccat component. It should be easy to >> understand from it how it works. The classifier has to be trained with >> an event (outcome and features) and can then classify a set of features >> in the categories it has seen before as outcome. >> >> Jörn >>
