Re: Model to detect the gender

2016-07-01 Thread Mondher Bouazizi
Hi,

Sorry for my late reply. I didn't understand well your last email, but here
is what I meant:

Given a simple dictionary you have that has the following columns:

Name   Type   Gender
Agatha First   F
JohnFirst   M
Smith  Both   B

where:
- "First" refers to first name, "Last" (not in the example) refers to last
name, and Both means it can be both.
- "F" refers to female, "M" refers to males, and "B" refers to both genders.

and given the following two sentences:

1. "It was nice meeting you John. I hope we meet again soon."

2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
she knows something"

In the first example, when you check in the dictionary, the name "John" is
a male name, so no need to go any further.
However, in the second example, the name "Smith", which is a family name in
our case, can be fit for both, males and females. Therefore, we need to
extract features from the surrounding context and perform a classification
task.
Here are some of the features I think they would be interesting to use:

. Presence of a male initiative before the word {True, False}
. Presence of a female initiative before the word {True, False}

. Gender of the first personal pronoun (subject or object form) to the
right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the right (in
words) Values=NUMERIC
. Gender of the second personal pronoun to the right of the
name Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the second personal pronoun right
 Values=NUMERIC
. Gender of the third personal pronoun to the right of the
name  Values={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the third personal pronoun right (in
words)  Values=NUMERIC

. Gender of the first personal pronoun (subject or object form) to the left
of the name   Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the left (in
words)Values=NUMERIC
. Gender of the second personal pronoun to the left of the
nameValues={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the second personal pronoun left
Values=NUMERIC
. Gender of the third personal pronoun to the left of the
nameValues={MALE, FEMALE,
UNCERTAIN, EMPTY}
. Distance between the name and the third personal pronoun left (in
words)Values=NUMERIC

In the second example here are the values you have for your features

F1 = False
F2 = True
F3 = UNCERTAIN
F4 = 1
F5 = FEMALE
F6 = 3
F7 = FEMALE
F8 = 4
F9 = UNCERTAIN
F10 = 2
F11 = EMPTY
F12 = 0
F13 = EMPTY
F14 = 0

Of course the choice of features depends on the type of data, and the
features themselves might not work well for some texts such as ones
collected from twitter for example.

I hope this help you.

Best regards

Mondher


On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Hi Mondher,
> could you give me a raw example to understand how i should train the
> classifier model?
>
> Thank you in advance!
> Damiano
>
>
> 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mondher.bouaz...@gmail.com>:
>
> > Hi,
> >
> > I would recommend a hybrid approach where, in a first step, you use a
> plain
> > dictionary and then perform the classification if needed.
> >
> > It's straightforward, but I think it would present better performances
> than
> > just performing a classification task.
> >
> > In the first step you use a dictionary of names along with an attribute
> > specifying whether the name fits for males, females or both. In case the
> > name fits for males or females exclusively, then no need to go any
> further.
> >
> > If the name fits for both genders, or is a family name etc., a second
> step
> > is needed where you extract features from the context (surrounding words,
> > etc.) and perform a classification task using any machine learning
> > algorithm.
> >
> > Another way would be using the information itself (whether the name fits
> > for males, females or both) as a feature when you perform the
> > classification.
> >
> > Best regards,
> >
> > Mondher
> >
> > I am not sure
> >
> > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <damianopo...@gmail.com>
> > wrote:
> >
> > > Awesome! Thank you so much WIlliam!
> > >
> > > 2016-06-29 13:36 GMT+02:00 Will

Re: Model to detect the gender

2016-06-29 Thread Mondher Bouazizi
Hi,

I would recommend a hybrid approach where, in a first step, you use a plain
dictionary and then perform the classification if needed.

It's straightforward, but I think it would present better performances than
just performing a classification task.

In the first step you use a dictionary of names along with an attribute
specifying whether the name fits for males, females or both. In case the
name fits for males or females exclusively, then no need to go any further.

If the name fits for both genders, or is a family name etc., a second step
is needed where you extract features from the context (surrounding words,
etc.) and perform a classification task using any machine learning
algorithm.

Another way would be using the information itself (whether the name fits
for males, females or both) as a feature when you perform the
classification.

Best regards,

Mondher

I am not sure

On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta 
wrote:

> Awesome! Thank you so much WIlliam!
>
> 2016-06-29 13:36 GMT+02:00 William Colen :
>
> > To create a NER model OpenNLP extracts features from the context, things
> > such as: word prefix and suffix, next word, previous word, previous word
> > prefix and suffix, next word prefix and suffix etc.
> > When you don't configure the feature generator it will apply the default:
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >
> > Default feature generator:
> >
> > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
> >  *new* AdaptiveFeatureGenerator[]{
> >*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> > 2),
> >*new* WindowFeatureGenerator(*new*
> > TokenClassFeatureGenerator(true), 2, 2),
> >*new* OutcomePriorFeatureGenerator(),
> >*new* PreviousMapFeatureGenerator(),
> >*new* BigramNameFeatureGenerator(),
> >*new* SentenceFeatureGenerator(true, false)
> >});
> >
> >
> > These default features should work for most cases (specially English),
> but
> > they of course can be incremented. If you do so, your model will take new
> > features in account. So yes, you are putting the features in your model.
> >
> > To configure custom features is not easy. I would start with the default
> > and use 10-fold cross-validation and take notes of its effectiveness.
> Than
> > change/add a feature, evaluate and take notes. Sometimes a feature that
> we
> > are sure would help can destroy the model effectiveness.
> >
> > Regards
> > William
> >
> >
> > 2016-06-29 7:00 GMT-03:00 Damiano Porta :
> >
> > > Thank you William! Really appreciated!
> > >
> > > I only do not get one point, when you said "You could increment your
> > > model using
> > > Custom Feature Generators" does it mean that i can "put" these features
> > > inside ONE *.bin* file (model) that implement different things, or,
> name
> > > finder is one thing and those feature generators other?
> > >
> > > Thank you in advance for the clarification.
> > >
> > > 2016-06-29 1:23 GMT+02:00 William Colen :
> > >
> > > > Not exactly. You would create a new NER model to replace yours.
> > > >
> > > > In this approach you would need a corpus like this:
> > > >
> > > >  Pierre Vinken  , 61 years old , will join the
> > > board
> > > > as a nonexecutive director Nov. 29 .
> > > > Mr .  Vinken  is chairman of Elsevier N.V. ,
> the
> > > > Dutch publishing group .  Jessie Robson  is
> > > > retiring , she was a board member for 5 years .
> > > >
> > > >
> > > > I am not an English native speaker, so I am not sure if the example
> is
> > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > disambiguation.
> > > >
> > > > With a corpus big enough maybe you could create a model that outputs
> > both
> > > > classes, personMale and personFemale. To train a model you can follow
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > >
> > > > Let's say your results are not good enough. You could increment your
> > > model
> > > > using Custom Feature Generators (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > and
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > ).
> > > >
> > > > One of the implemented featuregen can take a dictionary (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > ).
> > > > You can also implement other convenient FeatureGenerator, for
> instance
> > > > regex.
> > > >
> > > > Again, it is just a wild guess of how to implement 

Re: Performances of OpenNLP tools

2016-06-21 Thread Mondher Bouazizi
Hi,

Thank you for your replies.

Please Jeffrey accept once more my apologies for receiving the email twice.

I also think it would be great to have such studies on the performances of
OpenNLP.

I have been looking for this information and checked in many places,
including obviously google scholar, and I haven't found any serious studies
or reliable results. Most of the existing ones report the performances of
outdated releases of OpenNLP, and focus more on the execution time or
CPU/RAM consumption, etc.

I think such a comparison will help not only evaluate the overall accuracy,
but also highlight the issues with the existing models (as a matter of
fact, the existing models fail to recognize many of the hashtags in tweets:
the tokenizer splits them into the "#" symbol and a word that the PoS
tagger also fails to recognize).

Therefore, building Twitter-based models would also be useful, since many
of the works in academia / industry are focusing on Twitter data.

Best regards,

Mondher



On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <jasonbaldri...@gmail.com>
wrote:

> It would be fantastic to have these numbers. This is an example of
> something that would be a great contribution by someone trying to
> contribute to open source and who is maybe just getting into machine
> learning and natural language processing.
>
> For Twitter-ish text, it'd be great to look at models trained and evaluated
> on the Tweet NLP resources:
>
> http://www.cs.cmu.edu/~ark/TweetNLP/
>
> And comparing to how their models performed, etc. Also, it's worth looking
> at spaCy (Python NLP modules) for further comparisons.
>
> https://spacy.io/
>
> -Jason
>
> On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jzemer...@apache.org>
> wrote:
>
> > I saw the same question on the users list on June 17. At least I thought
> it
> > was the same question -- sorry if it wasn't.
> >
> > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> > > Well, hold on. He sent that mail (as of the time of this mail) 4
> > > mins previously. Maybe some folks need some time to reply ^_^
> > >
> > > ++
> > > Chris Mattmann, Ph.D.
> > > Chief Architect
> > > Instrument Software and Science Data Systems Section (398)
> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > Office: 168-519, Mailstop: 168-527
> > > Email: chris.a.mattm...@nasa.gov
> > > WWW:  http://sunset.usc.edu/~mattmann/
> > > ++
> > > Director, Information Retrieval and Data Science Group (IRDS)
> > > Adjunct Associate Professor, Computer Science Department
> > > University of Southern California, Los Angeles, CA 90089 USA
> > > WWW: http://irds.usc.edu/
> > > ++
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jzemer...@apache.org> wrote:
> > >
> > > >Hi Mondher,
> > > >
> > > >Since you didn't get any replies I'm guessing no one is aware of any
> > > >resources related to what you need. Google Scholar is a good place to
> > look
> > > >for papers referencing OpenNLP and its methods (in case you haven't
> > > >searched it already).
> > > >
> > > >Jeff
> > > >
> > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > >mondher.bouaz...@gmail.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Apologies if you received multiple copies of this email. I sent it
> to
> > > the
> > > >> users list a while ago, and haven't had an answer yet.
> > > >>
> > > >> I have been looking for a while if there is any relevant work that
> > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such
> as
> > > >> Twitter data, etc., and/or compared it to other libraries.
> > > >>
> > > >> By performances, I mean accuracy/precision, rather than time of
> > > execution,
> > > >> etc.
> > > >>
> > > >> If anyone can refer me to a paper or a work done in this context,
> that
> > > >> would be of great help.
> > > >>
> > > >> Thank you very much.
> > > >>
> > > >> Mondher
> > > >>
> > >
> >
>


Performances of OpenNLP tools

2016-06-20 Thread Mondher Bouazizi
Hi,

Apologies if you received multiple copies of this email. I sent it to the
users list a while ago, and haven't had an answer yet.

I have been looking for a while if there is any relevant work that
performed tests on the OpenNLP tools (in particular the Lemmatizer,
Tokenizer and PoS-Tagger) when used with short and noisy texts such as
Twitter data, etc., and/or compared it to other libraries.

By performances, I mean accuracy/precision, rather than time of execution,
etc.

If anyone can refer me to a paper or a work done in this context, that
would be of great help.

Thank you very much.

Mondher


Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-24 Thread Mondher Bouazizi
Hi,

I am sorry for my late reply.

Given the time difference between Japan and USA, I think I won't be
available on weekdays. I will be available only on Friday/Saturday morning
(9-10am EST).

I am not sure if Chris is OK with that, we had our previous meetings on
Saturday mornings.

Otherwise, please go ahead. I will join as soon as I can.

Thanks.

@Chris: my github ID is mondher-bouazizi

Best regards,

Mondher

On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
mensikova.anastas...@gmail.com> wrote:

> Hi Anthony,
>
> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday (after
> 8:30am EST). Let me know when exactly!
>
> Thank you,
> Anastasija
>
> On 24 April 2016 at 03:02, Anthony Beylerian <anthony.beyler...@gmail.com>
> wrote:
>
>> Hi Anastasija,
>>
>> I'm not available by those times (00-07 JST).  I could make it by
>> Madhawa's proposal, but otherwise please go ahead, we may discuss some
>> other time.
>>
>> @Chris: github ID : beylerian
>>
>> Best,
>>
>> Anthony
>>
>>
>> Please find my github profile https://github.com/madhawa-gunasekara
>>
>> Madhawa
>>
>> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <
>> madhaw...@gmail.com> wrote:
>>
>> > Hi Chris,
>> >
>> > I'm available on Tuesday & Wednesday after 6.00 pm IST.
>> >
>> > Thanks,
>> > Madhawa
>> >
>> > Madhawa
>> >
>> > On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova <
>> > mensikova.anastas...@gmail.com> wrote:
>> >
>> >> Hi Chris,
>> >>
>> >> Thank you very much for your email. I'm so excited to work with you!
>> >>
>> >> My Github name is amensiko.
>> >>
>> >> And yes, next week sounds good! I'm available on: Tuesday at 4:20pm
>> EST,
>> >> Thursday 11am - 2:30pm and 4:20 - 6pm EST, Friday 11am - 3pm EST.
>> >>
>> >> Thank you,
>> >> Anastasija
>> >>
>> >> On 23 April 2016 at 10:21, Mattmann, Chris A (3980) <
>> >> chris.a.mattm...@jpl.nasa.gov> wrote:
>> >>
>> >>> Hi Anastasija,
>> >>>
>> >>> Hope you are well. It’s now time to get started on the project.
>> >>> Monder, Anthony, Madhawa and I have been discussing ideas about
>> >>> how to proceed with the project and even developing a task list.
>> >>> Let’s get your tasks input into that list, and also coordinate.
>> >>>
>> >>> I also have an action to share some Spanish/English data to try
>> >>> and do cross lingual sentiment analysis.
>> >>>
>> >>> Are you available to chat this week?
>> >>>
>> >>> Cheers,
>> >>> Chris
>> >>>
>> >>> ++
>> >>> Chris Mattmann, Ph.D.
>> >>> Chief Architect
>> >>> Instrument Software and Science Data Systems Section (398)
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 168-519, Mailstop: 168-527
>> >>> Email: chris.a.mattm...@nasa.gov
>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>> ++
>> >>> Director, Information Retrieval and Data Science Group (IRDS)
>> >>> Adjunct Associate Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> WWW: http://irds.usc.edu/
>> >>> ++
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 4/23/16, 4:49 AM, "Anthony Beylerian" <anthony.beyler...@gmail.com
>> >
>> >>> wrote:
>> >>>
>> >>> >Hello,
>> >>> >
>> >>> >Congratulations for being accepted for this year's GSoC.
>> >>> >Although Mondher and myself will not participate this year as
>> students,
>> >>> we
>> >>> >will do our best to help.
>> >>> >We are currently busy with academic research, but will join the
>> efforts
>> >>> >when possible.
>> >>> >Otherwise, for any discussion concerning the proposed approaches,
>> please
>> >>> >let us know.
>> >>> >
>> >>> >Best,
>> >>> >
>> >>> >On Sat, Apr 23, 2016 at 6:02 PM, Madhawa Kasun Gunasekara <
>> >>> >madhaw...@gmail.com> wrote:
>> >>> >
>> >>> >> Sure we will start working on this.
>> >>> >>
>> >>> >> Thanks,
>> >>> >> Madhawa
>> >>> >>
>> >>> >> Madhawa
>> >>> >>
>> >>> >> On Sat, Apr 23, 2016 at 1:38 AM, Chris Mattmann <
>> mattm...@apache.org>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> Congrats!
>> >>> >>>
>> >>> >>> time to get started team.
>> >>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>
>


Re: GSOC2016 Sentiment Analysis

2016-03-29 Thread Mondher Bouazizi
Dear Madhawa,

Thank you for your interest in the proposals.
The current tasks we proposed refer to the classification and
quantification regardless of the topic.
This can be used in a larger context where the topic is not specified, or
not unique, in which case we will need to identify the topic(s).
Therefore, a topic detector would be a good idea to implement, in order to
complement this.

As for the Document Categorizer, it is a general purpose component with
basic features (n-gram, bag of words, etc.).
It is basically used for the classification of texts into a set of classes
defined by the user, whether they are sentiment classes or other.
However it doesn't perform well for this purpose.

Furthermore, the sentiment analysis component would not just perform the
naive classification but also additional tasks (e.g., quantification) and
implement more specific and sophisticated approaches.

Please share your thoughts.

Mondher



On Tue, Mar 29, 2016 at 1:51 PM, Madhawa Kasun Gunasekara <
madhaw...@gmail.com> wrote:

> Hi Chris / Antony
>
> yes I would like to work on this, This proposal address most of the things
> in Sentiment analysis,
> AFAIK most of the people use OpenNLP Document Categorizer for Sentiment
> Analysis, since there isn't a proper functionality to do sentiment analysis
> in OpenNLP, This would be great if we can add this feature on OpenNLP
> project, and also I would like to suggest that we should able to detect the
> target object of the opinions from this feature as well.
>
> WDYT ??
>
> Thanks,
> Madhawa
>
> Madhawa
>
> On Tue, Mar 29, 2016 at 2:11 AM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Dear Anthony,
>>
>> Great! These both sound like fantastic proposals and I’m happy
>> to be a mentor. Madhawa, would you like to join in on these
>> efforts?
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Anthony Beylerian 
>> Date: Monday, March 28, 2016 at 11:48 AM
>> To: "dev@opennlp.apache.org" ,
>> "mondher.bouaz...@gmail.com" 
>> Cc: Madhawa Kasun Gunasekara , jpluser
>> 
>> Subject: RE: GSOC2016 Sentiment Analysis
>>
>> >Dear Chris,
>> >
>> >Thank you for starting the discussion.
>> >We are glad there is an interest in a sentiment analysis component.
>> >
>> >My colleague Mondher posted the two JIRA issues related to Sentiment
>> >Analysis [1][2] as references for our proposals [3][4] for GSoC.
>> >In fact, we have been researching this topic at our university.
>> >We are hoping to participate this year and work on integrating both a
>> >sentiment classifier and a quantifier for the library.
>> >
>> >It would be nice to also have an interface with Tika, maybe we can
>> >collaborate ?
>> >We are also looking for mentors, in case someone is willing to support
>> >our proposals.
>> >
>> >Best,
>> >
>> >Anthony
>> >
>> >[1] https://issues.apache.org/jira/browse/OPENNLP-842
>> >[2] https://issues.apache.org/jira/browse/OPENNLP-840
>> >[3]
>> >
>> https://docs.google.com/document/d/1nVnwpmGaOnwHERXr55IClE4V87jUX2sva-mkgW
>> >nR8n0/edit?usp=sharing
>> >[4]
>> >
>> https://docs.google.com/document/d/1x02II9W3rirtuSbx_sY8kOQZSgOp0SIKeIWTCX
>> >EOJvo/edit?usp=sharing
>> >
>> >> From: chris.a.mattm...@jpl.nasa.gov
>> >> To: nishant@gmail.com
>> >> CC: dev@opennlp.apache.org; madhaw...@gmail.com; hmanj...@usc.edu;
>> >>kamal...@usc.edu
>> >> Subject: Re: GSOC2016 Sentiment Analysis
>> >> Date: Sun, 27 Mar 2016 19:34:24 +
>> >>
>> >> No problem - I just wanted to encourage discussion thank you for
>> >> your prompt and courteous replies.
>> >>
>> >> ++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattm...@nasa.gov
>> >> WWW: http://sunset.usc.edu/~mattmann/
>> >> ++
>> >> Director, Information Retrieval and Data Science Group (IRDS)
>> >> Adjunct Associate Professor, Computer Science 

Re: WSD - Supervised techniques

2015-07-14 Thread Mondher Bouazizi
Dear all,

Thank you Anthony for the detailed explanation.

Regarding the parser/converter classes that Anthony mentioned, I think it
would be a better idea to make an independent component in OpenNLP that
processes Semcor data. NLTK [1] for example, which is a Python library for
natural language processing contains a component to read Semcor data [2],
and which can be used by the other components (not only the WSD one).

For now, I am using MIT Jsemcor [3] (which is MIT licensed) to read semcor
files , but as soon as I finish the implementation of the remaining parts
of IMS (all words wsd [4] / coarse grained vs fine grained), I'll be
implementing our own semcor reader.

On the other hand, for now, I will clean the code of IMS and make it
independent from the format of the source of training data. All data will
pass through a connector. I will run the first tests using semcor data: The
evaluator will use semcor data for training, and the ones collected from
Senseval-3 for test (to compare the different approaches implemented).

Also, please watch the issues ([4]-[9]) so you can get updates each time we
add a patch for each component. Thanks.

Best regards,

Mondher



[1] http://www.nltk.org/api/nltk.html
[2]
http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.semcor
[3] http://projects.csail.mit.edu/jsemcor/
[4] https://issues.apache.org/jira/browse/OPENNLP-797
[5] https://issues.apache.org/jira/browse/OPENNLP-789
[6] https://issues.apache.org/jira/browse/OPENNLP-790
[7] https://issues.apache.org/jira/browse/OPENNLP-794
[8] https://issues.apache.org/jira/browse/OPENNLP-795
[9] https://issues.apache.org/jira/browse/OPENNLP-796

On Tue, Jul 14, 2015 at 1:54 AM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Dear Rodrigo,

 Thank you for the feedback.

 I have added [1][2][3] issues regarding the below.

 Concerning the testers (IMSTester etc) they should be in src/test/java/
 We can add docs in those to explain how to use each implementation.

 Actually, I am using the parser for Senseval3 that Mondher mentionedin
 [LeskEvaluatorTest], the functionality was included in DataExtractor.
 I believe it would be best to separate that and have two parser/converter
 classes of the sort :

 disambiguator.reader.SemCorReader,
 disambiguator.reader.SensevalReader.

 That should be clearer, what do you think ?

 Anthony

 [1]: https://issues.apache.org/jira/browse/OPENNLP-794
 [2]: https://issues.apache.org/jira/browse/OPENNLP-795[3]:
 https://issues.apache.org/jira/browse/OPENNLP-796

  From: rage...@apache.org
  Date: Mon, 13 Jul 2015 15:50:00 +0200
  Subject: Re: WSD - Supervised techniques
  To: dev@opennlp.apache.org
 
  Hello,
 
  It has been few public activity these last days. We believe that it is
  very important to step up in two directions wrt what is already commited
 in svn:
 
  1. Finishing the WSDEvaluator
  2. Provide the classes required to run the WSD tools from the CLI as
  any other component.
  3. Formats: it will be interesting to have at least conversor for the
  most common dataset used for evaluation and training. E.g., semcor and
  senseval-3. You have mentioned that a conversor was already
  implemented but I cannot find it in svn.
  4. Write the documentation so that future users (and other dev members
  here) can test the component.
 
  These comments were general for both unsupervised and supervised WSD.
  Specific to supervised WSD:
 
  5. IMS: you mention in your previous email that the lexical sample
  part is done and that you need to finish the all words IMS
  implementation. If this is the case, a JIRA issue should be open about
  it and make it a priority.
  Incidentally, I cannot find the IMSTester you mentioned in the email.
 
  There is an issue already there for the Evaluator (OPENNLP-790) but I
  think that each of the remaining tasks require their JIRA issues
  (these issue has pending unused imports, variables and other things).
 
  The aim before GSOC ends should be to have the best chance of having the
  WSDcomponent as a good candidate for its integration in the opennlp
  tools. Also, by being able to test it  we can see the actual state of
  the component with respect to performance in the usual datasets.
 
  Can you please create such issues in JIRA and start addressing them
 separately?
 
  Thanks,
 
  Rodrigo
 
 
 
  On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi
  mondher.bouaz...@gmail.com wrote:
   Hi everyone,
  
   I finished the first iteration of IMS approach for lexical sample
   disambiguation. Please find the patch uploaded on the jira issue [1]. I
   also created a tester (IMSTester) to run it.
  
   As I mentioned before, the approach is as follows: each time, the
 module is
   called to disambiguate a word, it first check if the model file for
 that
   word exists.
  
   1- If the model file exists, it is used to disambiguate the word
  
   2- Otherwise, if the file does not exist, the module checks

Re: WSD - Supervised techniques

2015-06-28 Thread Mondher Bouazizi
Hi everyone,

I finished the first iteration of IMS approach for lexical sample
disambiguation. Please find the patch uploaded on the jira issue [1]. I
also created a tester (IMSTester) to run it.

As I mentioned before, the approach is as follows: each time, the module is
called to disambiguate a word, it first check if the model file for that
word exists.

1- If the model file exists, it is used to disambiguate the word

2- Otherwise, if the file does not exist, the module checks if the training
data file for that word exists. If it does, the xml file data will be used
to train the model and create the model file.

3- If no training data exist, the most frequent sense (mfs) in WordNet is
returned.

For now I am using the training data I collected from Senseval and Semeval
websites. However, I am currently checking semcore to use it as a main
reference.

Yours sincerely,

Mondher

[1] https://issues.apache.org/jira/browse/OPENNLP-757



On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann kottm...@gmail.com wrote:

 On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote:
  Hi,
 
  Actually I have finished the implementation of most of the parts of the
 IMS
  approach. I also made a parser for the Senseval-3 data.
 
  However I am currently working on two main points:
 
  - I am trying to figure out how to use the MaxEnt classifier.
 Unfortunately
  there is no enough documentation, so I am trying to see how it is used by
  the other components of OpenNLP. Any recommendation ?

 Yes, have a look at the doccat component. It should be easy to
 understand from it how it works. The classifier has to be trained with
 an event (outcome and features) and can then classify a set of features
 in the categories it has seen before as outcome.

 Jörn



Re: GSoC 2015 - WSD Module

2015-06-08 Thread Mondher Bouazizi
Dear Rodrigo,

As Anthony mentioned in his previous email, I already started the
implementation of the IMS approach. The pre-processing and the extraction
of features have already been finished. Regarding the approach itself, it
shows some potential according to the author though the features proposed
are not so many, and are basic. I think the approach itself might be
enhanced if we add more context specific features from some other
approaches... (To do that, I need to run many experiments using different
combinations of features, however, that should not be a problem).
But the approach itself requires a linear SVM classifier, and as far as I
know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm
?

Regarding the training data, I started collecting some from different
sources. Most of the existing rich corpora are licensed (Including the ones
mentioned in the paper). The free ones I got for now are from the Senseval
and Semeval websites. However, these are used just to evaluate the proposed
methods in the workshops. Therefore, the words to disambiguate are few in
number though the training data for each word are rich enough.

In any case, the first tests with Senseval and Semeval collected should be
finished soon. However, I am not sure if there is a rich enough Dataset we
can use to make our model for the WSD module in the OpenNLP library.
If you have any recommendation, I would be grateful if you can help me on
this point.

On the other hand, we're cleaning our implementation of the different
variations of Lesk. However, we are currently using JWNL. If there are no
objections, we will migrate to extJWNL.

As Jörn mentioned sending an initial patch, should we separate our codes
and upload two different patches to the two issues we created on the Jira
(however, this means a lot of redundancy in the code), or shall we keep
them in one project and upload it? If we opt for the latter case, which
issue should we upload the patch to ?

Thanks,

Mondher, Anthony

On Mon, Jun 8, 2015 at 7:51 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello,

 +1 for using extJWNL instead of JWNL, I use it in some other projects
 too and it is very nice IMHO.

 R

 On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu
 aliaksa...@autayeu.com wrote:
  Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
  have questions.
 
  Aliaksandr
 
  On 6 June 2015 at 11:43, Richard Eckart de Castilho 
  richard.eck...@gmail.com wrote:
 
  On 05.06.2015, at 14:24, Anthony Beylerian 
 anthonybeyler...@hotmail.com
  wrote:
 
   So just to make sure, we are currently relying on JWNL to access
 WordNet
  as a resource.
 
  There is a more modern fork of JWNL available called
  http://extjwnl.sourceforge.net .
  It includes provisions of loading WordNet from the classpath, e.g.
  from Maven dependencies. It might be a nice replacement for JWNL and is
  also licensed
  under the BSD license. Pre-packaged WordNet Maven artifacts are also
  available.
 
  Cheers,
 
  -- Richard



Re: GSoC 2015 - WSD Module

2015-05-22 Thread Mondher Bouazizi
Hi all,

Thanks Rodrigo for the feedback.
I don't mind starting with IMS implementation as a first supervised
solution.
It seems to a good first step.
As for the SST, I will read more about it and will let you know.

On the other hand, how about the following interface Anthony and myself
prepared based on Jörn's recommendation.
We tried to be as close as possible to the other tools already implemented.

Link :
https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing

Best regards,

Mondher, Anthony



On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello Mondher (my response is about supervised WSD),

 Thanks for the info, it is quite interesting. Apart from the comment
 by Jörn, which I think is very important if we want to achieve
 something given the time constrains of the GSOC, I have a couple of
 recommendations/comments from my part:

 1. Rather than targeting Lexical Sample task or all words WSD I think
 it could be more operative to choose an approach/algorithm and try to
 implement it in OpenNLP. One of the most (it not the most) popular
 approaches is the it Makes Sense (IMS) system

 http://www.comp.nus.edu.sg/~nlp/sw/README.txt
 https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

 That I think is achievable in the GSOC time frame.

 2. As an aside, research has been moving towards supersense tagging
 (SST), given the dificulty of WSD.

 http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

 As you can see in the above paper, SST is approached as a sequence
 labelling task, rather than classification. This means that we could
 reimplement Ciaramita and Altun (2006) features implementing the
 AdaptiveFeatureGenerators and creating a module structurally similar
 to the NameFinder but for SST.

 This has also the advantage of being able to move to datasets that are
 not old Semcor and senseval and using current Tweet datasets and so
 on. See this recent paper on SST on tweets:

 http://aclweb.org/anthology/S14-1001

 I think that for supervised WSD, we should pursue option 1. or 2. and
 start definining the interface as Jörn has suggested.

 Best,

 Rodrigo

 On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
 anthonybeyler...@hotmail.com wrote:
  Dear all,
 
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
 
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid)
 
  - WSD is used for different directly related objectives such as all-words
  disambiguation, lexical sample disambiguation, multi/cross-lingual
  approaches etc.
 
  - Senseval/Semeval seem to be good references to compare different
  techniques for WSD since many of them were tested on the same data (but
  different one each event).
 
  - For the sake of making a first solution, we propose to start with
  supporting the lexical sample type of disambiguation, meaning to
  disambiguate single/limited word(s) from an input text.
 
 
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.
  Otherwise we have also collected the results of all the senseval/semeval
  exercises here.
  (Note that each document has many sheets)
  The collected results, could help decide on which techniques to start
 with
  as main models for each set of techniques (supervised/unsupervised).
 
  We also propose a general approach for the package in the figure
 attached.
  The main components are as follows :
 
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.
  However, we would also like to allow the users to use their own local
  resources, by maybe defining a type of connector to the resource
 interface.
 
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.
  We might even later consider building a local cache for remote services.
 
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.
  These techniques will be split into two main packages as in the left
 side of
  the figure :  Supervised/Unsupervised.
  The utils package includes common tools used in both types of techniques.
  The details mentioned in each package should be common to all
  implementations of these abstract models.
 
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
 
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
 
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi



Re: GSoC 2015 - WSD Module

2015-05-18 Thread Mondher Bouazizi
Dear all,

Sorry if you received multiple copies of this email (The links were
embedded). Here are the actual links:

*Figure:*
https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
*Semeval/senseval results summary:*
https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
*Literature survey of WSD techniques:*
https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

Yours faithfully

On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Please excuse the duplicate email, we could not attach the mentioned
 figure.
 Kindly find it here.
 Thank you.

 From: anthonybeyler...@hotmail.com
 To: dev@opennlp.apache.org
 Subject: GSoC 2015 - WSD Module
 Date: Mon, 18 May 2015 22:14:43 +0900




 Dear all,
 In the context of building a Word Sense Disambiguation (WSD) module, after
 doing a survey on WSD techniques, we realized the following points :
 - WSD techniques can be split into three sets (supervised,
 unsupervised/knowledge based, hybrid) - WSD is used for different directly
 related objectives such as all-words disambiguation, lexical sample
 disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem
 to be good references to compare different techniques for WSD since many of
 them were tested on the same data (but different one each event).- For the
 sake of making a first solution, we propose to start with supporting the
 lexical sample type of disambiguation, meaning to disambiguate
 single/limited word(s) from an input text.
 Therefore, we have decided to collect information about the different
 techniques in the literature (such as  references, performance, parameters
 etc.) in this spreadsheet here.Otherwise we have also collected the results
 of all the senseval/semeval exercises here.(Note that each document has
 many sheets)The collected results, could help decide on which techniques to
 start with as main models for each set of techniques
 (supervised/unsupervised).
 We also propose a general approach for the package in the figure
 attached.The main components are as follows :
 1- The different resources publicly available : WordNet, BabelNet,
 Wikipedia, etc.However, we would also like to allow the users to use their
 own local resources, by maybe defining a type of connector to the resource
 interface.
 2- The resource interface will have the role to provide both a sense
 inventory that the user can query and a knowledge base (such as semantic or
 syntactic info. etc.) that might be used depending on the technique.We
 might even later consider building a local cache for remote services.
 3- The WSD algorithms/techniques themselves that will make use of the
 resource interface to access the resources required.These techniques will
 be split into two main packages as in the left side of the figure :
 Supervised/Unsupervised.The utils package includes common tools used in
 both types of techniques.The details mentioned in each package should be
 common to all implementations of these abstract models.
 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
 structure following your recommendations.
 If you have any suggestions or recommendations, we would really appreciate
 discussing them and would like your guidance to iterate on this tool-set.
 Best regards,

 Anthony Beylerian, Mondher Bouazizi



GSoC - Self introduction

2015-05-03 Thread Mondher Bouazizi
Dear all,

I am Mondher Bouazizi, from Tunisia. I am a Master's student at Keio
University in Japan. My academic research is currently focusing on Data
Mining.

I am glad to inform you that my project proposal has been accepted for the
Google Summer of Code 2015. The proposal is to add a Word Sense
Disambiguation (WSD) component to the OpenNLP library.

The objective of WSD is to determine which sense of a word is meant in a
particular context. Different techniques are proposed in the academic
literature,but in general they fall mainly into two categories: Supervised
and Unsupervised. In my work I will design and build a WSD module that
implement the algorithms of common supervised techniques (e.g. Decision
Trees, Exemplar-Based or Instance-Based Learning, etc.) On the other hand,
my colleague Anthony, who got also accepted, will be working on the
unsupervised ones.

(For more details about the project, please check the issue I created here
https://issues.apache.org/jira/browse/OPENNLP-757)

I hope the work will make a good contribution to OpenNLP project and to the
open source community in general.

Yours sincerely,

Mondher Bouazizi