Re: Model to detect the gender
Hi, Sorry for my late reply. I didn't understand well your last email, but here is what I meant: Given a simple dictionary you have that has the following columns: Name Type Gender Agatha First F JohnFirst M Smith Both B where: - "First" refers to first name, "Last" (not in the example) refers to last name, and Both means it can be both. - "F" refers to female, "M" refers to males, and "B" refers to both genders. and given the following two sentences: 1. "It was nice meeting you John. I hope we meet again soon." 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt she knows something" In the first example, when you check in the dictionary, the name "John" is a male name, so no need to go any further. However, in the second example, the name "Smith", which is a family name in our case, can be fit for both, males and females. Therefore, we need to extract features from the surrounding context and perform a classification task. Here are some of the features I think they would be interesting to use: . Presence of a male initiative before the word {True, False} . Presence of a female initiative before the word {True, False} . Gender of the first personal pronoun (subject or object form) to the right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the first personal pronoun to the right (in words) Values=NUMERIC . Gender of the second personal pronoun to the right of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the second personal pronoun right Values=NUMERIC . Gender of the third personal pronoun to the right of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the third personal pronoun right (in words) Values=NUMERIC . Gender of the first personal pronoun (subject or object form) to the left of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the first personal pronoun to the left (in words)Values=NUMERIC . Gender of the second personal pronoun to the left of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the second personal pronoun left Values=NUMERIC . Gender of the third personal pronoun to the left of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY} . Distance between the name and the third personal pronoun left (in words)Values=NUMERIC In the second example here are the values you have for your features F1 = False F2 = True F3 = UNCERTAIN F4 = 1 F5 = FEMALE F6 = 3 F7 = FEMALE F8 = 4 F9 = UNCERTAIN F10 = 2 F11 = EMPTY F12 = 0 F13 = EMPTY F14 = 0 Of course the choice of features depends on the type of data, and the features themselves might not work well for some texts such as ones collected from twitter for example. I hope this help you. Best regards Mondher On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <damianopo...@gmail.com> wrote: > Hi Mondher, > could you give me a raw example to understand how i should train the > classifier model? > > Thank you in advance! > Damiano > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mondher.bouaz...@gmail.com>: > > > Hi, > > > > I would recommend a hybrid approach where, in a first step, you use a > plain > > dictionary and then perform the classification if needed. > > > > It's straightforward, but I think it would present better performances > than > > just performing a classification task. > > > > In the first step you use a dictionary of names along with an attribute > > specifying whether the name fits for males, females or both. In case the > > name fits for males or females exclusively, then no need to go any > further. > > > > If the name fits for both genders, or is a family name etc., a second > step > > is needed where you extract features from the context (surrounding words, > > etc.) and perform a classification task using any machine learning > > algorithm. > > > > Another way would be using the information itself (whether the name fits > > for males, females or both) as a feature when you perform the > > classification. > > > > Best regards, > > > > Mondher > > > > I am not sure > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <damianopo...@gmail.com> > > wrote: > > > > > Awesome! Thank you so much WIlliam! > > > > > > 2016-06-29 13:36 GMT+02:00 Will
Re: Model to detect the gender
Hi, I would recommend a hybrid approach where, in a first step, you use a plain dictionary and then perform the classification if needed. It's straightforward, but I think it would present better performances than just performing a classification task. In the first step you use a dictionary of names along with an attribute specifying whether the name fits for males, females or both. In case the name fits for males or females exclusively, then no need to go any further. If the name fits for both genders, or is a family name etc., a second step is needed where you extract features from the context (surrounding words, etc.) and perform a classification task using any machine learning algorithm. Another way would be using the information itself (whether the name fits for males, females or both) as a feature when you perform the classification. Best regards, Mondher I am not sure On Wed, Jun 29, 2016 at 10:27 PM, Damiano Portawrote: > Awesome! Thank you so much WIlliam! > > 2016-06-29 13:36 GMT+02:00 William Colen : > > > To create a NER model OpenNLP extracts features from the context, things > > such as: word prefix and suffix, next word, previous word, previous word > > prefix and suffix, next word prefix and suffix etc. > > When you don't configure the feature generator it will apply the default: > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api > > > > Default feature generator: > > > > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator( > > *new* AdaptiveFeatureGenerator[]{ > >*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, > > 2), > >*new* WindowFeatureGenerator(*new* > > TokenClassFeatureGenerator(true), 2, 2), > >*new* OutcomePriorFeatureGenerator(), > >*new* PreviousMapFeatureGenerator(), > >*new* BigramNameFeatureGenerator(), > >*new* SentenceFeatureGenerator(true, false) > >}); > > > > > > These default features should work for most cases (specially English), > but > > they of course can be incremented. If you do so, your model will take new > > features in account. So yes, you are putting the features in your model. > > > > To configure custom features is not easy. I would start with the default > > and use 10-fold cross-validation and take notes of its effectiveness. > Than > > change/add a feature, evaluate and take notes. Sometimes a feature that > we > > are sure would help can destroy the model effectiveness. > > > > Regards > > William > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta : > > > > > Thank you William! Really appreciated! > > > > > > I only do not get one point, when you said "You could increment your > > > model using > > > Custom Feature Generators" does it mean that i can "put" these features > > > inside ONE *.bin* file (model) that implement different things, or, > name > > > finder is one thing and those feature generators other? > > > > > > Thank you in advance for the clarification. > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen : > > > > > > > Not exactly. You would create a new NER model to replace yours. > > > > > > > > In this approach you would need a corpus like this: > > > > > > > > Pierre Vinken , 61 years old , will join the > > > board > > > > as a nonexecutive director Nov. 29 . > > > > Mr . Vinken is chairman of Elsevier N.V. , > the > > > > Dutch publishing group . Jessie Robson is > > > > retiring , she was a board member for 5 years . > > > > > > > > > > > > I am not an English native speaker, so I am not sure if the example > is > > > > clear enough. I tried to use Jessie as a neutral name and "she" as > > > > disambiguation. > > > > > > > > With a corpus big enough maybe you could create a model that outputs > > both > > > > classes, personMale and personFemale. To train a model you can follow > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training > > > > > > > > Let's say your results are not good enough. You could increment your > > > model > > > > using Custom Feature Generators ( > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen > > > > and > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html > > > > ). > > > > > > > > One of the implemented featuregen can take a dictionary ( > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html > > > > ). > > > > You can also implement other convenient FeatureGenerator, for > instance > > > > regex. > > > > > > > > Again, it is just a wild guess of how to implement
Re: Performances of OpenNLP tools
Hi, Thank you for your replies. Please Jeffrey accept once more my apologies for receiving the email twice. I also think it would be great to have such studies on the performances of OpenNLP. I have been looking for this information and checked in many places, including obviously google scholar, and I haven't found any serious studies or reliable results. Most of the existing ones report the performances of outdated releases of OpenNLP, and focus more on the execution time or CPU/RAM consumption, etc. I think such a comparison will help not only evaluate the overall accuracy, but also highlight the issues with the existing models (as a matter of fact, the existing models fail to recognize many of the hashtags in tweets: the tokenizer splits them into the "#" symbol and a word that the PoS tagger also fails to recognize). Therefore, building Twitter-based models would also be useful, since many of the works in academia / industry are focusing on Twitter data. Best regards, Mondher On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <jasonbaldri...@gmail.com> wrote: > It would be fantastic to have these numbers. This is an example of > something that would be a great contribution by someone trying to > contribute to open source and who is maybe just getting into machine > learning and natural language processing. > > For Twitter-ish text, it'd be great to look at models trained and evaluated > on the Tweet NLP resources: > > http://www.cs.cmu.edu/~ark/TweetNLP/ > > And comparing to how their models performed, etc. Also, it's worth looking > at spaCy (Python NLP modules) for further comparisons. > > https://spacy.io/ > > -Jason > > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jzemer...@apache.org> > wrote: > > > I saw the same question on the users list on June 17. At least I thought > it > > was the same question -- sorry if it wasn't. > > > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) < > > chris.a.mattm...@jpl.nasa.gov> wrote: > > > > > Well, hold on. He sent that mail (as of the time of this mail) 4 > > > mins previously. Maybe some folks need some time to reply ^_^ > > > > > > ++ > > > Chris Mattmann, Ph.D. > > > Chief Architect > > > Instrument Software and Science Data Systems Section (398) > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > > Office: 168-519, Mailstop: 168-527 > > > Email: chris.a.mattm...@nasa.gov > > > WWW: http://sunset.usc.edu/~mattmann/ > > > ++ > > > Director, Information Retrieval and Data Science Group (IRDS) > > > Adjunct Associate Professor, Computer Science Department > > > University of Southern California, Los Angeles, CA 90089 USA > > > WWW: http://irds.usc.edu/ > > > ++ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jzemer...@apache.org> wrote: > > > > > > >Hi Mondher, > > > > > > > >Since you didn't get any replies I'm guessing no one is aware of any > > > >resources related to what you need. Google Scholar is a good place to > > look > > > >for papers referencing OpenNLP and its methods (in case you haven't > > > >searched it already). > > > > > > > >Jeff > > > > > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi < > > > >mondher.bouaz...@gmail.com> wrote: > > > > > > > >> Hi, > > > >> > > > >> Apologies if you received multiple copies of this email. I sent it > to > > > the > > > >> users list a while ago, and haven't had an answer yet. > > > >> > > > >> I have been looking for a while if there is any relevant work that > > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer, > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such > as > > > >> Twitter data, etc., and/or compared it to other libraries. > > > >> > > > >> By performances, I mean accuracy/precision, rather than time of > > > execution, > > > >> etc. > > > >> > > > >> If anyone can refer me to a paper or a work done in this context, > that > > > >> would be of great help. > > > >> > > > >> Thank you very much. > > > >> > > > >> Mondher > > > >> > > > > > >
Performances of OpenNLP tools
Hi, Apologies if you received multiple copies of this email. I sent it to the users list a while ago, and haven't had an answer yet. I have been looking for a while if there is any relevant work that performed tests on the OpenNLP tools (in particular the Lemmatizer, Tokenizer and PoS-Tagger) when used with short and noisy texts such as Twitter data, etc., and/or compared it to other libraries. By performances, I mean accuracy/precision, rather than time of execution, etc. If anyone can refer me to a paper or a work done in this context, that would be of great help. Thank you very much. Mondher
Re: GSoC 2016: OpenNLP Sentiment Analysis
Hi, I am sorry for my late reply. Given the time difference between Japan and USA, I think I won't be available on weekdays. I will be available only on Friday/Saturday morning (9-10am EST). I am not sure if Chris is OK with that, we had our previous meetings on Saturday mornings. Otherwise, please go ahead. I will join as soon as I can. Thanks. @Chris: my github ID is mondher-bouazizi Best regards, Mondher On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova < mensikova.anastas...@gmail.com> wrote: > Hi Anthony, > > I can make it by Madhawa's proposal too, after 6pm IST on Tuesday (after > 8:30am EST). Let me know when exactly! > > Thank you, > Anastasija > > On 24 April 2016 at 03:02, Anthony Beylerian <anthony.beyler...@gmail.com> > wrote: > >> Hi Anastasija, >> >> I'm not available by those times (00-07 JST). I could make it by >> Madhawa's proposal, but otherwise please go ahead, we may discuss some >> other time. >> >> @Chris: github ID : beylerian >> >> Best, >> >> Anthony >> >> >> Please find my github profile https://github.com/madhawa-gunasekara >> >> Madhawa >> >> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara < >> madhaw...@gmail.com> wrote: >> >> > Hi Chris, >> > >> > I'm available on Tuesday & Wednesday after 6.00 pm IST. >> > >> > Thanks, >> > Madhawa >> > >> > Madhawa >> > >> > On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova < >> > mensikova.anastas...@gmail.com> wrote: >> > >> >> Hi Chris, >> >> >> >> Thank you very much for your email. I'm so excited to work with you! >> >> >> >> My Github name is amensiko. >> >> >> >> And yes, next week sounds good! I'm available on: Tuesday at 4:20pm >> EST, >> >> Thursday 11am - 2:30pm and 4:20 - 6pm EST, Friday 11am - 3pm EST. >> >> >> >> Thank you, >> >> Anastasija >> >> >> >> On 23 April 2016 at 10:21, Mattmann, Chris A (3980) < >> >> chris.a.mattm...@jpl.nasa.gov> wrote: >> >> >> >>> Hi Anastasija, >> >>> >> >>> Hope you are well. It’s now time to get started on the project. >> >>> Monder, Anthony, Madhawa and I have been discussing ideas about >> >>> how to proceed with the project and even developing a task list. >> >>> Let’s get your tasks input into that list, and also coordinate. >> >>> >> >>> I also have an action to share some Spanish/English data to try >> >>> and do cross lingual sentiment analysis. >> >>> >> >>> Are you available to chat this week? >> >>> >> >>> Cheers, >> >>> Chris >> >>> >> >>> ++ >> >>> Chris Mattmann, Ph.D. >> >>> Chief Architect >> >>> Instrument Software and Science Data Systems Section (398) >> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >>> Office: 168-519, Mailstop: 168-527 >> >>> Email: chris.a.mattm...@nasa.gov >> >>> WWW: http://sunset.usc.edu/~mattmann/ >> >>> ++ >> >>> Director, Information Retrieval and Data Science Group (IRDS) >> >>> Adjunct Associate Professor, Computer Science Department >> >>> University of Southern California, Los Angeles, CA 90089 USA >> >>> WWW: http://irds.usc.edu/ >> >>> ++ >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On 4/23/16, 4:49 AM, "Anthony Beylerian" <anthony.beyler...@gmail.com >> > >> >>> wrote: >> >>> >> >>> >Hello, >> >>> > >> >>> >Congratulations for being accepted for this year's GSoC. >> >>> >Although Mondher and myself will not participate this year as >> students, >> >>> we >> >>> >will do our best to help. >> >>> >We are currently busy with academic research, but will join the >> efforts >> >>> >when possible. >> >>> >Otherwise, for any discussion concerning the proposed approaches, >> please >> >>> >let us know. >> >>> > >> >>> >Best, >> >>> > >> >>> >On Sat, Apr 23, 2016 at 6:02 PM, Madhawa Kasun Gunasekara < >> >>> >madhaw...@gmail.com> wrote: >> >>> > >> >>> >> Sure we will start working on this. >> >>> >> >> >>> >> Thanks, >> >>> >> Madhawa >> >>> >> >> >>> >> Madhawa >> >>> >> >> >>> >> On Sat, Apr 23, 2016 at 1:38 AM, Chris Mattmann < >> mattm...@apache.org> >> >>> >> wrote: >> >>> >> >> >>> >>> Congrats! >> >>> >>> >> >>> >>> time to get started team. >> >>> >>> >> >>> >> >> >> >> >> > >> > >
Re: GSOC2016 Sentiment Analysis
Dear Madhawa, Thank you for your interest in the proposals. The current tasks we proposed refer to the classification and quantification regardless of the topic. This can be used in a larger context where the topic is not specified, or not unique, in which case we will need to identify the topic(s). Therefore, a topic detector would be a good idea to implement, in order to complement this. As for the Document Categorizer, it is a general purpose component with basic features (n-gram, bag of words, etc.). It is basically used for the classification of texts into a set of classes defined by the user, whether they are sentiment classes or other. However it doesn't perform well for this purpose. Furthermore, the sentiment analysis component would not just perform the naive classification but also additional tasks (e.g., quantification) and implement more specific and sophisticated approaches. Please share your thoughts. Mondher On Tue, Mar 29, 2016 at 1:51 PM, Madhawa Kasun Gunasekara < madhaw...@gmail.com> wrote: > Hi Chris / Antony > > yes I would like to work on this, This proposal address most of the things > in Sentiment analysis, > AFAIK most of the people use OpenNLP Document Categorizer for Sentiment > Analysis, since there isn't a proper functionality to do sentiment analysis > in OpenNLP, This would be great if we can add this feature on OpenNLP > project, and also I would like to suggest that we should able to detect the > target object of the opinions from this feature as well. > > WDYT ?? > > Thanks, > Madhawa > > Madhawa > > On Tue, Mar 29, 2016 at 2:11 AM, Mattmann, Chris A (3980) < > chris.a.mattm...@jpl.nasa.gov> wrote: > >> Dear Anthony, >> >> Great! These both sound like fantastic proposals and I’m happy >> to be a mentor. Madhawa, would you like to join in on these >> efforts? >> >> Cheers, >> Chris >> >> ++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++ >> Director, Information Retrieval and Data Science Group (IRDS) >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++ >> >> >> >> >> >> -Original Message- >> From: Anthony Beylerian>> Date: Monday, March 28, 2016 at 11:48 AM >> To: "dev@opennlp.apache.org" , >> "mondher.bouaz...@gmail.com" >> Cc: Madhawa Kasun Gunasekara , jpluser >> >> Subject: RE: GSOC2016 Sentiment Analysis >> >> >Dear Chris, >> > >> >Thank you for starting the discussion. >> >We are glad there is an interest in a sentiment analysis component. >> > >> >My colleague Mondher posted the two JIRA issues related to Sentiment >> >Analysis [1][2] as references for our proposals [3][4] for GSoC. >> >In fact, we have been researching this topic at our university. >> >We are hoping to participate this year and work on integrating both a >> >sentiment classifier and a quantifier for the library. >> > >> >It would be nice to also have an interface with Tika, maybe we can >> >collaborate ? >> >We are also looking for mentors, in case someone is willing to support >> >our proposals. >> > >> >Best, >> > >> >Anthony >> > >> >[1] https://issues.apache.org/jira/browse/OPENNLP-842 >> >[2] https://issues.apache.org/jira/browse/OPENNLP-840 >> >[3] >> > >> https://docs.google.com/document/d/1nVnwpmGaOnwHERXr55IClE4V87jUX2sva-mkgW >> >nR8n0/edit?usp=sharing >> >[4] >> > >> https://docs.google.com/document/d/1x02II9W3rirtuSbx_sY8kOQZSgOp0SIKeIWTCX >> >EOJvo/edit?usp=sharing >> > >> >> From: chris.a.mattm...@jpl.nasa.gov >> >> To: nishant@gmail.com >> >> CC: dev@opennlp.apache.org; madhaw...@gmail.com; hmanj...@usc.edu; >> >>kamal...@usc.edu >> >> Subject: Re: GSOC2016 Sentiment Analysis >> >> Date: Sun, 27 Mar 2016 19:34:24 + >> >> >> >> No problem - I just wanted to encourage discussion thank you for >> >> your prompt and courteous replies. >> >> >> >> ++ >> >> Chris Mattmann, Ph.D. >> >> Chief Architect >> >> Instrument Software and Science Data Systems Section (398) >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >> Office: 168-519, Mailstop: 168-527 >> >> Email: chris.a.mattm...@nasa.gov >> >> WWW: http://sunset.usc.edu/~mattmann/ >> >> ++ >> >> Director, Information Retrieval and Data Science Group (IRDS) >> >> Adjunct Associate Professor, Computer Science
Re: WSD - Supervised techniques
Dear all, Thank you Anthony for the detailed explanation. Regarding the parser/converter classes that Anthony mentioned, I think it would be a better idea to make an independent component in OpenNLP that processes Semcor data. NLTK [1] for example, which is a Python library for natural language processing contains a component to read Semcor data [2], and which can be used by the other components (not only the WSD one). For now, I am using MIT Jsemcor [3] (which is MIT licensed) to read semcor files , but as soon as I finish the implementation of the remaining parts of IMS (all words wsd [4] / coarse grained vs fine grained), I'll be implementing our own semcor reader. On the other hand, for now, I will clean the code of IMS and make it independent from the format of the source of training data. All data will pass through a connector. I will run the first tests using semcor data: The evaluator will use semcor data for training, and the ones collected from Senseval-3 for test (to compare the different approaches implemented). Also, please watch the issues ([4]-[9]) so you can get updates each time we add a patch for each component. Thanks. Best regards, Mondher [1] http://www.nltk.org/api/nltk.html [2] http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.semcor [3] http://projects.csail.mit.edu/jsemcor/ [4] https://issues.apache.org/jira/browse/OPENNLP-797 [5] https://issues.apache.org/jira/browse/OPENNLP-789 [6] https://issues.apache.org/jira/browse/OPENNLP-790 [7] https://issues.apache.org/jira/browse/OPENNLP-794 [8] https://issues.apache.org/jira/browse/OPENNLP-795 [9] https://issues.apache.org/jira/browse/OPENNLP-796 On Tue, Jul 14, 2015 at 1:54 AM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Dear Rodrigo, Thank you for the feedback. I have added [1][2][3] issues regarding the below. Concerning the testers (IMSTester etc) they should be in src/test/java/ We can add docs in those to explain how to use each implementation. Actually, I am using the parser for Senseval3 that Mondher mentionedin [LeskEvaluatorTest], the functionality was included in DataExtractor. I believe it would be best to separate that and have two parser/converter classes of the sort : disambiguator.reader.SemCorReader, disambiguator.reader.SensevalReader. That should be clearer, what do you think ? Anthony [1]: https://issues.apache.org/jira/browse/OPENNLP-794 [2]: https://issues.apache.org/jira/browse/OPENNLP-795[3]: https://issues.apache.org/jira/browse/OPENNLP-796 From: rage...@apache.org Date: Mon, 13 Jul 2015 15:50:00 +0200 Subject: Re: WSD - Supervised techniques To: dev@opennlp.apache.org Hello, It has been few public activity these last days. We believe that it is very important to step up in two directions wrt what is already commited in svn: 1. Finishing the WSDEvaluator 2. Provide the classes required to run the WSD tools from the CLI as any other component. 3. Formats: it will be interesting to have at least conversor for the most common dataset used for evaluation and training. E.g., semcor and senseval-3. You have mentioned that a conversor was already implemented but I cannot find it in svn. 4. Write the documentation so that future users (and other dev members here) can test the component. These comments were general for both unsupervised and supervised WSD. Specific to supervised WSD: 5. IMS: you mention in your previous email that the lexical sample part is done and that you need to finish the all words IMS implementation. If this is the case, a JIRA issue should be open about it and make it a priority. Incidentally, I cannot find the IMSTester you mentioned in the email. There is an issue already there for the Evaluator (OPENNLP-790) but I think that each of the remaining tasks require their JIRA issues (these issue has pending unused imports, variables and other things). The aim before GSOC ends should be to have the best chance of having the WSDcomponent as a good candidate for its integration in the opennlp tools. Also, by being able to test it we can see the actual state of the component with respect to performance in the usual datasets. Can you please create such issues in JIRA and start addressing them separately? Thanks, Rodrigo On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi mondher.bouaz...@gmail.com wrote: Hi everyone, I finished the first iteration of IMS approach for lexical sample disambiguation. Please find the patch uploaded on the jira issue [1]. I also created a tester (IMSTester) to run it. As I mentioned before, the approach is as follows: each time, the module is called to disambiguate a word, it first check if the model file for that word exists. 1- If the model file exists, it is used to disambiguate the word 2- Otherwise, if the file does not exist, the module checks
Re: WSD - Supervised techniques
Hi everyone, I finished the first iteration of IMS approach for lexical sample disambiguation. Please find the patch uploaded on the jira issue [1]. I also created a tester (IMSTester) to run it. As I mentioned before, the approach is as follows: each time, the module is called to disambiguate a word, it first check if the model file for that word exists. 1- If the model file exists, it is used to disambiguate the word 2- Otherwise, if the file does not exist, the module checks if the training data file for that word exists. If it does, the xml file data will be used to train the model and create the model file. 3- If no training data exist, the most frequent sense (mfs) in WordNet is returned. For now I am using the training data I collected from Senseval and Semeval websites. However, I am currently checking semcore to use it as a main reference. Yours sincerely, Mondher [1] https://issues.apache.org/jira/browse/OPENNLP-757 On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann kottm...@gmail.com wrote: On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote: Hi, Actually I have finished the implementation of most of the parts of the IMS approach. I also made a parser for the Senseval-3 data. However I am currently working on two main points: - I am trying to figure out how to use the MaxEnt classifier. Unfortunately there is no enough documentation, so I am trying to see how it is used by the other components of OpenNLP. Any recommendation ? Yes, have a look at the doccat component. It should be easy to understand from it how it works. The classifier has to be trained with an event (outcome and features) and can then classify a set of features in the categories it has seen before as outcome. Jörn
Re: GSoC 2015 - WSD Module
Dear Rodrigo, As Anthony mentioned in his previous email, I already started the implementation of the IMS approach. The pre-processing and the extraction of features have already been finished. Regarding the approach itself, it shows some potential according to the author though the features proposed are not so many, and are basic. I think the approach itself might be enhanced if we add more context specific features from some other approaches... (To do that, I need to run many experiments using different combinations of features, however, that should not be a problem). But the approach itself requires a linear SVM classifier, and as far as I know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm ? Regarding the training data, I started collecting some from different sources. Most of the existing rich corpora are licensed (Including the ones mentioned in the paper). The free ones I got for now are from the Senseval and Semeval websites. However, these are used just to evaluate the proposed methods in the workshops. Therefore, the words to disambiguate are few in number though the training data for each word are rich enough. In any case, the first tests with Senseval and Semeval collected should be finished soon. However, I am not sure if there is a rich enough Dataset we can use to make our model for the WSD module in the OpenNLP library. If you have any recommendation, I would be grateful if you can help me on this point. On the other hand, we're cleaning our implementation of the different variations of Lesk. However, we are currently using JWNL. If there are no objections, we will migrate to extJWNL. As Jörn mentioned sending an initial patch, should we separate our codes and upload two different patches to the two issues we created on the Jira (however, this means a lot of redundancy in the code), or shall we keep them in one project and upload it? If we opt for the latter case, which issue should we upload the patch to ? Thanks, Mondher, Anthony On Mon, Jun 8, 2015 at 7:51 PM, Rodrigo Agerri rage...@apache.org wrote: Hello, +1 for using extJWNL instead of JWNL, I use it in some other projects too and it is very nice IMHO. R On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu aliaksa...@autayeu.com wrote: Thinking of impartiality... Anyway, I'm the author of extJWNL in case you have questions. Aliaksandr On 6 June 2015 at 11:43, Richard Eckart de Castilho richard.eck...@gmail.com wrote: On 05.06.2015, at 14:24, Anthony Beylerian anthonybeyler...@hotmail.com wrote: So just to make sure, we are currently relying on JWNL to access WordNet as a resource. There is a more modern fork of JWNL available called http://extjwnl.sourceforge.net . It includes provisions of loading WordNet from the classpath, e.g. from Maven dependencies. It might be a nice replacement for JWNL and is also licensed under the BSD license. Pre-packaged WordNet Maven artifacts are also available. Cheers, -- Richard
Re: GSoC 2015 - WSD Module
Hi all, Thanks Rodrigo for the feedback. I don't mind starting with IMS implementation as a first supervised solution. It seems to a good first step. As for the SST, I will read more about it and will let you know. On the other hand, how about the following interface Anthony and myself prepared based on Jörn's recommendation. We tried to be as close as possible to the other tools already implemented. Link : https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing Best regards, Mondher, Anthony On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri rage...@apache.org wrote: Hello Mondher (my response is about supervised WSD), Thanks for the info, it is quite interesting. Apart from the comment by Jörn, which I think is very important if we want to achieve something given the time constrains of the GSOC, I have a couple of recommendations/comments from my part: 1. Rather than targeting Lexical Sample task or all words WSD I think it could be more operative to choose an approach/algorithm and try to implement it in OpenNLP. One of the most (it not the most) popular approaches is the it Makes Sense (IMS) system http://www.comp.nus.edu.sg/~nlp/sw/README.txt https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf That I think is achievable in the GSOC time frame. 2. As an aside, research has been moving towards supersense tagging (SST), given the dificulty of WSD. http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf As you can see in the above paper, SST is approached as a sequence labelling task, rather than classification. This means that we could reimplement Ciaramita and Altun (2006) features implementing the AdaptiveFeatureGenerators and creating a module structurally similar to the NameFinder but for SST. This has also the advantage of being able to move to datasets that are not old Semcor and senseval and using current Tweet datasets and so on. See this recent paper on SST on tweets: http://aclweb.org/anthology/S14-1001 I think that for supervised WSD, we should pursue option 1. or 2. and start definining the interface as Jörn has suggested. Best, Rodrigo On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc. - Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event). - For the sake of making a first solution, we propose to start with supporting the lexical sample type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here. Otherwise we have also collected the results of all the senseval/semeval exercises here. (Note that each document has many sheets) The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached. The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc. However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique. We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required. These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised. The utils package includes common tools used in both types of techniques. The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi
Re: GSoC 2015 - WSD Module
Dear all, Sorry if you received multiple copies of this email (The links were embedded). Here are the actual links: *Figure:* https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing *Semeval/senseval results summary:* https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing *Literature survey of WSD techniques:* https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing Yours faithfully On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian anthonybeyler...@hotmail.com wrote: Please excuse the duplicate email, we could not attach the mentioned figure. Kindly find it here. Thank you. From: anthonybeyler...@hotmail.com To: dev@opennlp.apache.org Subject: GSoC 2015 - WSD Module Date: Mon, 18 May 2015 22:14:43 +0900 Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event).- For the sake of making a first solution, we propose to start with supporting the lexical sample type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here.Otherwise we have also collected the results of all the senseval/semeval exercises here.(Note that each document has many sheets)The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached.The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc.However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique.We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required.These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised.The utils package includes common tools used in both types of techniques.The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi
GSoC - Self introduction
Dear all, I am Mondher Bouazizi, from Tunisia. I am a Master's student at Keio University in Japan. My academic research is currently focusing on Data Mining. I am glad to inform you that my project proposal has been accepted for the Google Summer of Code 2015. The proposal is to add a Word Sense Disambiguation (WSD) component to the OpenNLP library. The objective of WSD is to determine which sense of a word is meant in a particular context. Different techniques are proposed in the academic literature,but in general they fall mainly into two categories: Supervised and Unsupervised. In my work I will design and build a WSD module that implement the algorithms of common supervised techniques (e.g. Decision Trees, Exemplar-Based or Instance-Based Learning, etc.) On the other hand, my colleague Anthony, who got also accepted, will be working on the unsupervised ones. (For more details about the project, please check the issue I created here https://issues.apache.org/jira/browse/OPENNLP-757) I hope the work will make a good contribution to OpenNLP project and to the open source community in general. Yours sincerely, Mondher Bouazizi