Re: GSoC 2015 - WSD Module
Hello, one of the tasks we should start is, is to define the interface for the WSD component. Please have a look at the other components in OpenNLP and try to propose an interface in a similar style. Can we use one interface for all the different implementations? Jörn On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi mondher.bouaz...@gmail.com wrote: Dear all, Sorry if you received multiple copies of this email (The links were embedded). Here are the actual links: *Figure:* https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing *Semeval/senseval results summary:* https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing *Literature survey of WSD techniques:* https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing Yours faithfully On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Please excuse the duplicate email, we could not attach the mentioned figure. Kindly find it here. Thank you. From: anthonybeyler...@hotmail.com To: dev@opennlp.apache.org Subject: GSoC 2015 - WSD Module Date: Mon, 18 May 2015 22:14:43 +0900 Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event).- For the sake of making a first solution, we propose to start with supporting the lexical sample type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here.Otherwise we have also collected the results of all the senseval/semeval exercises here.(Note that each document has many sheets)The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached.The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc.However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique.We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required.These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised.The utils package includes common tools used in both types of techniques.The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi
Re: GSoC 2015 - WSD Module
Hello Mondher (my response is about supervised WSD), Thanks for the info, it is quite interesting. Apart from the comment by Jörn, which I think is very important if we want to achieve something given the time constrains of the GSOC, I have a couple of recommendations/comments from my part: 1. Rather than targeting Lexical Sample task or all words WSD I think it could be more operative to choose an approach/algorithm and try to implement it in OpenNLP. One of the most (it not the most) popular approaches is the it Makes Sense (IMS) system http://www.comp.nus.edu.sg/~nlp/sw/README.txt https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf That I think is achievable in the GSOC time frame. 2. As an aside, research has been moving towards supersense tagging (SST), given the dificulty of WSD. http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf As you can see in the above paper, SST is approached as a sequence labelling task, rather than classification. This means that we could reimplement Ciaramita and Altun (2006) features implementing the AdaptiveFeatureGenerators and creating a module structurally similar to the NameFinder but for SST. This has also the advantage of being able to move to datasets that are not old Semcor and senseval and using current Tweet datasets and so on. See this recent paper on SST on tweets: http://aclweb.org/anthology/S14-1001 I think that for supervised WSD, we should pursue option 1. or 2. and start definining the interface as Jörn has suggested. Best, Rodrigo On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc. - Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event). - For the sake of making a first solution, we propose to start with supporting the lexical sample type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here. Otherwise we have also collected the results of all the senseval/semeval exercises here. (Note that each document has many sheets) The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached. The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc. However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique. We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required. These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised. The utils package includes common tools used in both types of techniques. The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi
Re: W2VClassesDictionary class
Hello, You are right I kept it there while I was doing the test with the WordClusterFeatureGenerator. I will remove it. Best, R On Fri, May 22, 2015 at 1:51 PM, Joern Kottmann kottm...@gmail.com wrote: Hello, looks like this class was renamed into WordClusterDictionary. Can the class W2VClassesDictionary be removed? We shouldn't include it in RC4 when it is not necessary. Thanks, Jörn
W2VClassesDictionary class
Hello, looks like this class was renamed into WordClusterDictionary. Can the class W2VClassesDictionary be removed? We shouldn't include it in RC4 when it is not necessary. Thanks, Jörn
OpenNLP RC4
Hello, we should now be in a good state to do RC4. We finally solved the performance problems with the parser and a couple of very minor things where fixed as well (e.g NOTICE file update). A major addition since RC3 are the automated evaluation tests to speed up our release process. I hope this will significantly reduce the amount of time required to ensure RC4 is working properly. Jörn
Re: GSoC 2015 - WSD Module
Hi all, Thanks Rodrigo for the feedback. I don't mind starting with IMS implementation as a first supervised solution. It seems to a good first step. As for the SST, I will read more about it and will let you know. On the other hand, how about the following interface Anthony and myself prepared based on Jörn's recommendation. We tried to be as close as possible to the other tools already implemented. Link : https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing Best regards, Mondher, Anthony On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri rage...@apache.org wrote: Hello Mondher (my response is about supervised WSD), Thanks for the info, it is quite interesting. Apart from the comment by Jörn, which I think is very important if we want to achieve something given the time constrains of the GSOC, I have a couple of recommendations/comments from my part: 1. Rather than targeting Lexical Sample task or all words WSD I think it could be more operative to choose an approach/algorithm and try to implement it in OpenNLP. One of the most (it not the most) popular approaches is the it Makes Sense (IMS) system http://www.comp.nus.edu.sg/~nlp/sw/README.txt https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf That I think is achievable in the GSOC time frame. 2. As an aside, research has been moving towards supersense tagging (SST), given the dificulty of WSD. http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf As you can see in the above paper, SST is approached as a sequence labelling task, rather than classification. This means that we could reimplement Ciaramita and Altun (2006) features implementing the AdaptiveFeatureGenerators and creating a module structurally similar to the NameFinder but for SST. This has also the advantage of being able to move to datasets that are not old Semcor and senseval and using current Tweet datasets and so on. See this recent paper on SST on tweets: http://aclweb.org/anthology/S14-1001 I think that for supervised WSD, we should pursue option 1. or 2. and start definining the interface as Jörn has suggested. Best, Rodrigo On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc. - Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event). - For the sake of making a first solution, we propose to start with supporting the lexical sample type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here. Otherwise we have also collected the results of all the senseval/semeval exercises here. (Note that each document has many sheets) The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached. The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc. However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique. We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required. These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised. The utils package includes common tools used in both types of techniques. The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi