Update to Java 8

2016-12-19 Thread Joern Kottmann
Hello all,

Java 7 is already EOL.

Should we update OpenNLP to Java 8 for the 1.7.0 release, any opinions?

Jörn


Re: TODO in GeneratorFactory.java

2016-12-13 Thread Joern Kottmann
Yes, that is a nice change, can you open a jira issue for it and send me
the PR?

Would like to include that.

Jörn

On Tue, Dec 13, 2016 at 1:41 PM, Jeffrey Zemerick 
wrote:

> Hi everyone,
>
> I came across a TODO in GeneratorFactory.java to make
> the TokenClassFeatureGenerator construction configurable. My cursory search
> of the JIRA didn't show any related tasks. I presumed the configuration was
> referring to the boolean generateWordAndClassFeature argument in the
> constructor of TokenClassFeatureGenerator. I added the parameter to the XML
> and pass that value to the constructor. If there is no parameter in the XML
> it sets to `true` to maintain backward compatibility. A diff of my changes
> can be seen at
> https://github.com/apache/opennlp/compare/trunk...jzonthemtn:
> TokenClassFeatureGenerator
> .
>
> If this was still a valid "todo" item and the changes I made were what was
> desired I can submit a patch or a pull request.
>
> Jeff
>


Re: Next release

2016-11-09 Thread Joern Kottmann
Hmm, I think it is probably easier if we keep the master branch for the
release and have separate topic branches for new work and if necessary a
next branch. After the release is done we can merge the finished work into
master and repeat. That way we could have master always in a good state
which is almost ready for release.

Having a stable master which goes from release to release makes it also
easier to use git bisect to automatically find commits that broke something.

For some bigger changes I will also run more often the extended tests I
have, if i have those in a topic branch before we merge that would be
great, if there are issues we don't have to break the master branch. With
git it is easier to deal with branches than it was before with subversion.
My goal is to get this kind of workflow automated with a build server
running and reporting to us on each Pull Request.

Jörn

On Wed, Nov 9, 2016 at 9:52 AM, Rodrigo Agerri <rage...@apache.org> wrote:

> Hello,
>
> No problem. Should I just create a release-1.7 branch and so we can do
> all the work towards the next release there? Or would you prefer
> different branches?
>
> Cheers,
>
> R
>
> On Tue, Nov 8, 2016 at 12:07 PM, Joern Kottmann <kottm...@gmail.com>
> wrote:
> > Hello Rodrigo,
> >
> > would you mind to add this to our README file?
> >
> > It is in opennlp-distr and should contain the notable changes for the
> > release, anyone else please also add in your changes there. Currently it
> > still contains the contents for 1.6.0.
> >
> > You can just start working on it in a separate branch, with git we can
> > support a work flow where we merge in features/changes when they are
> ready.
> > I thin this is also a really good approach because we can then run the
> > extensive tests before we merge a branch.
> >
> > Jörn
> >
> > On Tue, Nov 8, 2016 at 9:48 AM, Rodrigo Agerri <rage...@apache.org>
> wrote:
> >
> >> Hello,
> >>
> >> +1 1.7.0 in next release and +1 for a yearly release
> >>
> >> Just to provide some info, the main changes in the lemmatizer have been:
> >>
> >> 1. Added a supervised statistical lemmatizer, usable from the CLI and
> >> API. The supervised lemmaitzer now provides a much better coverage for
> >> unknown words with respect to the previously existing dictionary-based
> >> one.
> >> 2. The lemmatizer component has been rewritten and the API therefore
> >> has substantially changed. Thus, the changes in the Dictionary-based
> >> lemmatizer are not backward compatible. In any case, I do not think
> >> that so many people was using it and the change at using the API is
> >> minor.
> >>
> >> The new statistical lemmatizer can support the Dictionary-based
> >> lemmatizers often used to provide features for components such as Word
> >> Sense Disambiguation, Opinion Mining/Sentiment Analysis, etc. In this
> >> regard, it will be nice to aim at working on the development of those
> >> two components for their release. Maybe the next release is too close,
> >> but definitely for the next one.
> >>
> >> Cheers,
> >>
> >> Rodrigo
> >>
> >> On Mon, Nov 7, 2016 at 7:01 PM, Russ, Daniel (NIH/CIT) [E]
> >> <dr...@mail.nih.gov> wrote:
> >> > Also the lemmatizer has significantly changed.  I vote 1.7
> >> >
> >> > On 11/7/16, 12:59 PM, "Joern Kottmann" <kottm...@gmail.com> wrote:
> >> >
> >> > Hello all,
> >> >
> >> > since our last release it has been a while and we received quite a
> >> few
> >> > changes which would be nice to get released.
> >> >
> >> > There are still some open Jira issues, but mostly smaller things
> that
> >> > can be wrapped up rather quickly.
> >> >
> >> > Is there anything important missing which should go into the next
> >> > release? Otherwise I think we should also aim for more frequent
> >> > released and just make one again early next year, with all the
> stuff
> >> we
> >> > might miss out now.
> >> >
> >> > We took in a patch - as part of OPENNLP-830 - to replace our
> >> self-made
> >> > hash table with the java.util.HashMap. This change is not backward
> >> > compatible for folks who extend AbstractModel.
> >> >
> >> > Should we go with 1.6.1 as a next version or should we make 1.7.0
> to
> >> > reflect that?
> >> >
> >> > Previously we only had backward incompatible changes in versions
> >> which
> >> > bumped by the second number. Maybe that is better choice. It will
> >> > probably break some peoples code when they update.
> >> >
> >> > We also have lots of deprecated API still in OpenNLP, should we
> try
> >> to
> >> > remove as much as possible of it now?
> >> >
> >> > Jörn
> >> >
> >> >
> >>
>


Next release

2016-11-07 Thread Joern Kottmann
Hello all,

since our last release it has been a while and we received quite a few
changes which would be nice to get released.

There are still some open Jira issues, but mostly smaller things that
can be wrapped up rather quickly.

Is there anything important missing which should go into the next
release? Otherwise I think we should also aim for more frequent
released and just make one again early next year, with all the stuff we
might miss out now.

We took in a patch - as part of OPENNLP-830 - to replace our self-made
hash table with the java.util.HashMap. This change is not backward
compatible for folks who extend AbstractModel.

Should we go with 1.6.1 as a next version or should we make 1.7.0 to
reflect that?

Previously we only had backward incompatible changes in versions which
bumped by the second number. Maybe that is better choice. It will
probably break some peoples code when they update.

We also have lots of deprecated API still in OpenNLP, should we try to
remove as much as possible of it now?

Jörn


Re: Why can i not serialize a Dictionary ?

2016-10-29 Thread Joern Kottmann
)
> > > at opennlp.tools.util.featuregen.GeneratorFactory.createGenerat
> > > or(GeneratorFactory.java:661)
> > > at opennlp.tools.util.featuregen.GeneratorFactory$CachedFeature
> > > GeneratorFactory.create(GeneratorFactory.java:171)
> > > at opennlp.tools.util.featuregen.GeneratorFactory.createGenerat
> > > or(GeneratorFactory.java:661)
> > > at opennlp.tools.util.featuregen.GeneratorFactory$AggregatedFea
> > > tureGeneratorFactory.create(GeneratorFactory.java:129)
> > > at opennlp.tools.util.featuregen.GeneratorFactory.createGenerat
> > > or(GeneratorFactory.java:661)
> > > at opennlp.tools.util.featuregen.GeneratorFactory.create(Genera
> > > torFactory.java:711)
> > > at opennlp.tools.namefind.TokenNameFinderFactory.createFeatureG
> > > enerators(TokenNameFinderFactory.java:153)
> > > ... 4 more
> > > 
> > > 2016-10-28 12:55 GMT+02:00 Joern Kottmann <kottm...@gmail.com>:
> > > 
> > > > 
> > > > Try to rename the dictionary key to xyz.dictionary then the
> > > > serializer
> > > > will
> > > > be mapped correctly.
> > > > 
> > > > Jörn
> > > > 
> > > > On Thu, Oct 27, 2016 at 11:14 PM, Damiano Porta <damianoporta@g
> > > > mail.com>
> > > > wrote:
> > > > 
> > > > > 
> > > > > Jorn i add the Dictionary here:
> > > > > https://gist.github.com/anonymous/bc822fb0520c4c42b75748bf414
> > > > > 7da
> > > > > 34#file-train-java-L15
> > > > > 
> > > > > And unfortunately i only see this error:
> > > > > 
> > > > > java.lang.IllegalStateException: Missing serializer for
> > > > > damiano
> > > > > at
> > > > > opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:6
> > > > > 10)
> > > > > 
> > > > > I do not have other info.
> > > > > Do i have to create a custom Serializer too?
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 2016-10-27 22:04 GMT+02:00 Joern Kottmann <kottm...@gmail.com
> > > > > >:
> > > > > 
> > > > > > 
> > > > > > On Thu, 2016-10-27 at 21:18 +0200, Joern Kottmann wrote:
> > > > > > > 
> > > > > > > On Tue, 2016-10-25 at 18:49 +0200, Damiano Porta wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > i am getting a strange error during the compiling of a
> > > > > > > > NER model.
> > > > > > > > Basically, the end of the build output is:
> > > > > > > > 
> > > > > > > >  98:  ... loglikelihood=-13340.018762351776
> > > > > > > > 0.999005934601099
> > > > > > > >  99:  ... loglikelihood=-13258.358751926637
> > > > > > > > 0.9990120681028991
> > > > > > > > 100:  ... loglikelihood=-13178.039964721707
> > > > > > > > 0.9990177634974279
> > > > > > > > Exception in thread "main"
> > > > > > > > java.lang.IllegalStateException:
> > > > Missing
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > serializer for *mydictionary*
> > > > > > > > at
> > > > > > > > opennlp.tools.util.model.BaseModel.serialize(BaseModel.
> > > > > > > > java:
> > > > 610)
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Can you please post the full exception stack trace?
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > And what is the name of they key you used for the
> > > > > > dictionary?
> > > > > > The dictionary serializers are only mapped by extension.
> > > > > > 
> > > > > > Jörn
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 


Re: Why can i not serialize a Dictionary ?

2016-10-28 Thread Joern Kottmann
Try to rename the dictionary key to xyz.dictionary then the serializer will
be mapped correctly.

Jörn

On Thu, Oct 27, 2016 at 11:14 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Jorn i add the Dictionary here:
> https://gist.github.com/anonymous/bc822fb0520c4c42b75748bf4147da
> 34#file-train-java-L15
>
> And unfortunately i only see this error:
>
> java.lang.IllegalStateException: Missing serializer for damiano
> at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:610)
>
> I do not have other info.
> Do i have to create a custom Serializer too?
>
>
>
>
> 2016-10-27 22:04 GMT+02:00 Joern Kottmann <kottm...@gmail.com>:
>
> > On Thu, 2016-10-27 at 21:18 +0200, Joern Kottmann wrote:
> > > On Tue, 2016-10-25 at 18:49 +0200, Damiano Porta wrote:
> > > >
> > > > i am getting a strange error during the compiling of a NER model.
> > > > Basically, the end of the build output is:
> > > >
> > > >  98:  ... loglikelihood=-13340.018762351776 0.999005934601099
> > > >  99:  ... loglikelihood=-13258.358751926637 0.9990120681028991
> > > > 100:  ... loglikelihood=-13178.039964721707 0.9990177634974279
> > > > Exception in thread "main" java.lang.IllegalStateException: Missing
> > > > serializer for *mydictionary*
> > > > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:610)
> > >
> > >
> > > Can you please post the full exception stack trace?
> > >
> >
> >
> > And what is the name of they key you used for the dictionary?
> > The dictionary serializers are only mapped by extension.
> >
> > Jörn
> >
>


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 16:04 +, Russ, Daniel (NIH/CIT) [E] wrote:
> Is it important to calculate the hash of all events?

I missed that question. No this is included for debug purposes only,
with the has it is possible to see if two models have been trained from
exactly the same source with identical feature generation. I used this
a log to debug isseus between OpenNLP versions.

Jörn


Re: Why can i not serialize a Dictionary ?

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 21:18 +0200, Joern Kottmann wrote:
> On Tue, 2016-10-25 at 18:49 +0200, Damiano Porta wrote:
> > 
> > i am getting a strange error during the compiling of a NER model.
> > Basically, the end of the build output is:
> > 
> >  98:  ... loglikelihood=-13340.018762351776 0.999005934601099
> >  99:  ... loglikelihood=-13258.358751926637 0.9990120681028991
> > 100:  ... loglikelihood=-13178.039964721707 0.9990177634974279
> > Exception in thread "main" java.lang.IllegalStateException: Missing
> > serializer for *mydictionary*
> > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:610)
> 
> 
> Can you please post the full exception stack trace?
> 


And what is the name of they key you used for the dictionary?
The dictionary serializers are only mapped by extension.

Jörn


Re: Why can i not serialize a Dictionary ?

2016-10-27 Thread Joern Kottmann
On Tue, 2016-10-25 at 18:49 +0200, Damiano Porta wrote:
> i am getting a strange error during the compiling of a NER model.
> Basically, the end of the build output is:
> 
>  98:  ... loglikelihood=-13340.018762351776 0.999005934601099
>  99:  ... loglikelihood=-13258.358751926637 0.9990120681028991
> 100:  ... loglikelihood=-13178.039964721707 0.9990177634974279
> Exception in thread "main" java.lang.IllegalStateException: Missing
> serializer for *mydictionary*
> at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:610)


Can you please post the full exception stack trace?

Jörn


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 16:04 +, Russ, Daniel (NIH/CIT) [E] wrote:
> Hello,
> 
>    Okay, I found why my toy worked.  I call
> AbstractEventTrainer.doTrain(DataIndexer) as oppose to
> AbstractEventTrainer.train(ObjectStream).   The train method
> calls isValid(). That sets the value of threads in QNTrainer.
> 
>    Thank you for making me do this.  I don’t think doTrain() should
> be exposed anymore.  A new method train(DataIndexer) that calls
> isValid and then doTrain(indexer) is probably a better idea.  Is it
> important to calculate the hash of all events?


This should really be in the init method and not in isValid. What do
you think?

The init method was added for the plugable ml support in 1.6.0 and I
believe I didn't see that this should be moved there.

Jörn


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 15:49 +, Russ, Daniel (NIH/CIT) [E] wrote:
> 
> Comment 2:
> Do you have a preference where the variable should go?  I think
> AbstractTrainer is the appropriate place for PSF variable dealing
> with ALL trainers, so Threads_(P/D) should be there.  I would remove
> and refactor out of TrainingParams.

TrainingParameters is the class which is parsing the passed in params
file. There is has to know about "Algorithm" all the others are
specific to the trainer implementation.

I think AbstractTrainer is probably a good place for PSF variables
which deal with many/most trainers.


> Comment 3:
> Right I want to change the dataindexer.
> 
> So I have multiple models that classify data (Job descriptions) into
> Occupational Codes.  I know what the codes are aprori, and even if
> they are not in the training data, I need to make sure that there is
> SOME probability for the codes.  More importantly for each job
> description, I need to compare the probabilities returned for each
> output.  By forcing the output indices to have the same values, I can
> quickly compare them without re-mapping the output.
> 
> I tried to extend OnePassDataIndex, but the indexing occurs during
> object construction, so I cannot set the known outputs before
> indexing occurs.  
> 
> Of course I would not need the getDataIndexer() method,  but it is
> defined in the Abstract class, why not in the Interface


The thing is that with the current interface we can support
implementations which don't use the Data Indexer. This can be the case
when it relies on external machine learning libraries. Since 1.6.0 we
have plugable ml support.

I looked closer now, the getDataIndexer is a factory method for the
Data Indexer. Maybe it would make sense to allow to specify a custom
class for data indexing as part of the training parameters? Then the
trainer who use the Data Indexer can just support that mechanism.

Jörn


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] <
dr...@mail.nih.gov> wrote:

> Hello,
>
> Background:
>I am developing a tool that uses OpenNLP.  I have a model that extends
> BaseModel, and several AbstractModels.  I allow the user (myself) to
> specify the TrainerType (GIS/QN) for each model by using a list of
> TrainingParameters.
>
> Potential Bugs:
>
> 1)Whenever I use QNTrainer, I get an error (number of threads <1).  I
> think the problem is that the parameters are initialized in the isValid()
> method instead of the init() method.  This works for GIS because in the
> doTrain(DataIndexer) method, the number of threads is a local variable
> taken from the TrainingParameters not a field in GIS.  This leads to
> another question. When it the isValid() method supposed to be called?  I am
> surprised that the TrainerFactory does not call it.
>
>
It should be be called from the factory I think. Currently it is only
called when the training starts in
AbstractEventTrainer.train(ObjectStream). It is always better to
make things fail as early as possible.

Can you share the exception stack trace? I don't really understand yet why
you get this error with the QNTrainer. I would like to investigate that.


>
> 2)The psf (public static final) String variables used by the
> TrainingParameters are all over the place.  The variables
> THEADS_(PARAM/DEFAULT) are defined in both QNTrainer and
> TrainingParameters.  It should be defined in one of the places. I am not
> sure that AbstractTrainer isn’t the best place to put THREADS_(P/D).  It
> isn’t just the variables Threads_(P/D), All the Training psf String
> variables from TrainingParameters are duplicated in AbstractTrainer.
>
>
I agree, the commonly used variables should only be in one place. Some
trainer have specific variables which are not shared. It would be nice to
get this re-factored.



>
> 3)Should the Interface EventTrainer have a doTrainDataIndexer and a
> getDataIndexer method?  This is important to me because I extended
> OnePassDataIndexer to pre-assign the outputs.  I know the outputs aprori,
> and I want to quick combine the results of the multiple models.  Since the
> getEventTrainer returns an EventTrainer instead of an AbstractEventTrainer,
> I cannot call doTrain(DataIndexer).  I cannot use the
> doTrain(ObjectStream); it creates a new OnePassIndexer.
>


I think it would be fine to add a second train method to EventTrainer which
takes a DataIndexer. The current train method should probably be changed a
bit, and not do the init things.
It would like to understand your usage here a bit better.

So you want to have control over the DataIndexer which is used for
training, right?
Another option could be to have a second train(ObjectStream,
DataIndexer) method.

And why would you need a getDataIndexer method if you can pass in your own
instance?


Jörn


Re: Custom Features Generator example

2016-10-25 Thread Joern Kottmann
We should probably create an example and add it to our documentation.

Jörn

On Tue, Oct 25, 2016 at 1:39 PM, Joern Kottmann <kottm...@gmail.com> wrote:

> You need to use a constructor which is public and has no arguments.
>
> The parameters can be passed in only if you extend CustomFeatureGenerator.
> That one has an init method which gives you the attributes defined in the
> xml descriptor.
>
> HTH,
> Jörn
>
> On Tue, Oct 25, 2016 at 12:43 PM, Damiano Porta <damianopo...@gmail.com>
> wrote:
>
>> Joern,
>> However i also tried with:
>>
>> public SpanFeatureGenerator(Map<String, String> properties,
>> FeatureGeneratorResourceProvider resourceProvider) throws
>> InvalidFormatException {
>>
>> }
>>
>> but i get the same exception.
>> Damiano
>>
>> 2016-10-25 12:30 GMT+02:00 Damiano Porta <damianopo...@gmail.com>:
>>
>> > This at the moment:
>> >
>> > public SpanFeatureGenerator(String prefix, Object finder, int
>> > prevWindowSize,  int nextWindowSize) {
>> >
>> > System.out.println(prefix);
>> > System.out.println((String)finder);
>> > System.out.println(prevWindowSize);
>> > System.out.println(nextWindowSize);
>> > System.exit(1);
>> >
>> > }
>> >
>> > It is obviously a test to understand if my generator is called.
>> >
>> >
>> > 2016-10-25 12:23 GMT+02:00 Joern Kottmann <kottm...@gmail.com>:
>> >
>> >> What is the constructor of the
>> >> com.damiano.parser.generator.SpanFeatureGenerator
>> >> class?
>> >>
>> >> Jörn
>> >>
>> >> On Tue, Oct 25, 2016 at 11:51 AM, Damiano Porta <
>> damianopo...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hello,
>> >> > I have created a custom generator implementing the
>> >> AdaptiveFeatureGenerator
>> >> > interface.
>> >> >
>> >> > I am getting this error:
>> >> >
>> >> > Exception in thread "main"
>> >> > opennlp.tools.util.ext.ExtensionNotLoadedException:
>> >> > java.lang.InstantiationException:
>> >> > com.damiano.parser.generator.SpanFeatureGenerator
>> >> > at
>> >> > opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
>> >> > ExtensionLoader.java:72)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory$
>> >> > CustomFeatureGeneratorFactory.create(GeneratorFactory.java:582)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
>> >> > GeneratorFactory.java:661)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory$
>> >> > AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
>> >> > GeneratorFactory.java:661)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory$
>> >> > CachedFeatureGeneratorFactory.create(GeneratorFactory.java:171)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
>> >> > GeneratorFactory.java:661)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory$
>> >> > AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
>> >> > GeneratorFactory.java:661)
>> >> > at
>> >> > opennlp.tools.util.featuregen.GeneratorFactory.create(
>> >> > GeneratorFactory.java:711)
>> >> > at
>> >> > opennlp.tools.namefind.TokenNameFinderFactory.createFeatureG
>> enerators(
>> >> > TokenNameFinderFactory.java:153)
>> >> > at
>> >> > opennlp.tools.namefind.TokenNameFinderFactory.createContextG
>> enerator(
>> >> > TokenNameFinderFactory.java:118)
>> >> > at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:333)
>> >> > at com.damiano.parser.trainer.NER.compileNER(NER.java:161)
>> >> > at com.damiano.parser.trainer.NER.main(NER.java:136)
>> >> >
>> >> > Caused by: java.lang.InstantiationException:
>> >> > com.damiano.parser.generator.SpanFeatureGenerator
>> >> > at java.lang.Class.newInstance(Class.java:427)
>> >> > at
>> >> > opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
>> >> > ExtensionLoader.java:70)
>> >> > ... 14 more
>> >> >
>> >> > Caused by: java.lang.NoSuchMethodException:
>> >> > com.damiano.parser.generator.SpanFeatureGenerator.()
>> >> > at java.lang.Class.getConstructor0(Class.java:3082)
>> >> > at java.lang.Class.newInstance(Class.java:412)
>> >> > ... 15 more
>> >> >
>> >> > the xml is:
>> >> >
>> >> > 
>> >> >   
>> >> > 
>> >> >   
>> >> > 
>> >> >   
>> >> >   
>> >> > 
>> >> >   
>> >> >   
>> >> >   
>> >> >   
>> >> >   
>> >> >   > >> > prefix="name" finder="blablabla" prevWindowSize="3"
>> nextWindowSize="3"/>
>> >> > 
>> >> >   
>> >> > 
>> >> >
>> >> > What can i do?
>> >> > Thank you!
>> >> >
>> >> > Damiano
>> >> >
>> >>
>> >
>> >
>>
>
>


Re: Custom Features Generator example

2016-10-25 Thread Joern Kottmann
You need to use a constructor which is public and has no arguments.

The parameters can be passed in only if you extend CustomFeatureGenerator.
That one has an init method which gives you the attributes defined in the
xml descriptor.

HTH,
Jörn

On Tue, Oct 25, 2016 at 12:43 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Joern,
> However i also tried with:
>
> public SpanFeatureGenerator(Map<String, String> properties,
> FeatureGeneratorResourceProvider resourceProvider) throws
> InvalidFormatException {
>
> }
>
> but i get the same exception.
> Damiano
>
> 2016-10-25 12:30 GMT+02:00 Damiano Porta <damianopo...@gmail.com>:
>
> > This at the moment:
> >
> > public SpanFeatureGenerator(String prefix, Object finder, int
> > prevWindowSize,  int nextWindowSize) {
> >
> > System.out.println(prefix);
> > System.out.println((String)finder);
> > System.out.println(prevWindowSize);
> > System.out.println(nextWindowSize);
> > System.exit(1);
> >
> > }
> >
> > It is obviously a test to understand if my generator is called.
> >
> >
> > 2016-10-25 12:23 GMT+02:00 Joern Kottmann <kottm...@gmail.com>:
> >
> >> What is the constructor of the
> >> com.damiano.parser.generator.SpanFeatureGenerator
> >> class?
> >>
> >> Jörn
> >>
> >> On Tue, Oct 25, 2016 at 11:51 AM, Damiano Porta <damianopo...@gmail.com
> >
> >> wrote:
> >>
> >> > Hello,
> >> > I have created a custom generator implementing the
> >> AdaptiveFeatureGenerator
> >> > interface.
> >> >
> >> > I am getting this error:
> >> >
> >> > Exception in thread "main"
> >> > opennlp.tools.util.ext.ExtensionNotLoadedException:
> >> > java.lang.InstantiationException:
> >> > com.damiano.parser.generator.SpanFeatureGenerator
> >> > at
> >> > opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
> >> > ExtensionLoader.java:72)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory$
> >> > CustomFeatureGeneratorFactory.create(GeneratorFactory.java:582)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> >> > GeneratorFactory.java:661)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory$
> >> > AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> >> > GeneratorFactory.java:661)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory$
> >> > CachedFeatureGeneratorFactory.create(GeneratorFactory.java:171)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> >> > GeneratorFactory.java:661)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory$
> >> > AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> >> > GeneratorFactory.java:661)
> >> > at
> >> > opennlp.tools.util.featuregen.GeneratorFactory.create(
> >> > GeneratorFactory.java:711)
> >> > at
> >> > opennlp.tools.namefind.TokenNameFinderFactory.
> createFeatureGenerators(
> >> > TokenNameFinderFactory.java:153)
> >> > at
> >> > opennlp.tools.namefind.TokenNameFinderFactory.createContextGenerator(
> >> > TokenNameFinderFactory.java:118)
> >> > at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:333)
> >> > at com.damiano.parser.trainer.NER.compileNER(NER.java:161)
> >> > at com.damiano.parser.trainer.NER.main(NER.java:136)
> >> >
> >> > Caused by: java.lang.InstantiationException:
> >> > com.damiano.parser.generator.SpanFeatureGenerator
> >> > at java.lang.Class.newInstance(Class.java:427)
> >> > at
> >> > opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
> >> > ExtensionLoader.java:70)
> >> > ... 14 more
> >> >
> >> > Caused by: java.lang.NoSuchMethodException:
> >> > com.damiano.parser.generator.SpanFeatureGenerator.()
> >> > at java.lang.Class.getConstructor0(Class.java:3082)
> >> > at java.lang.Class.newInstance(Class.java:412)
> >> > ... 15 more
> >> >
> >> > the xml is:
> >> >
> >> > 
> >> >   
> >> > 
> >> >   
> >> > 
> >> >   
> >> >   
> >> > 
> >> >   
> >> >   
> >> >   
> >> >   
> >> >   
> >> >>> > prefix="name" finder="blablabla" prevWindowSize="3"
> nextWindowSize="3"/>
> >> > 
> >> >   
> >> > 
> >> >
> >> > What can i do?
> >> > Thank you!
> >> >
> >> > Damiano
> >> >
> >>
> >
> >
>


Re: Custom Features Generator example

2016-10-25 Thread Joern Kottmann
What is the constructor of the
com.damiano.parser.generator.SpanFeatureGenerator
class?

Jörn

On Tue, Oct 25, 2016 at 11:51 AM, Damiano Porta 
wrote:

> Hello,
> I have created a custom generator implementing the AdaptiveFeatureGenerator
> interface.
>
> I am getting this error:
>
> Exception in thread "main"
> opennlp.tools.util.ext.ExtensionNotLoadedException:
> java.lang.InstantiationException:
> com.damiano.parser.generator.SpanFeatureGenerator
> at
> opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
> ExtensionLoader.java:72)
> at
> opennlp.tools.util.featuregen.GeneratorFactory$
> CustomFeatureGeneratorFactory.create(GeneratorFactory.java:582)
> at
> opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> GeneratorFactory.java:661)
> at
> opennlp.tools.util.featuregen.GeneratorFactory$
> AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
> at
> opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> GeneratorFactory.java:661)
> at
> opennlp.tools.util.featuregen.GeneratorFactory$
> CachedFeatureGeneratorFactory.create(GeneratorFactory.java:171)
> at
> opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> GeneratorFactory.java:661)
> at
> opennlp.tools.util.featuregen.GeneratorFactory$
> AggregatedFeatureGeneratorFactory.create(GeneratorFactory.java:129)
> at
> opennlp.tools.util.featuregen.GeneratorFactory.createGenerator(
> GeneratorFactory.java:661)
> at
> opennlp.tools.util.featuregen.GeneratorFactory.create(
> GeneratorFactory.java:711)
> at
> opennlp.tools.namefind.TokenNameFinderFactory.createFeatureGenerators(
> TokenNameFinderFactory.java:153)
> at
> opennlp.tools.namefind.TokenNameFinderFactory.createContextGenerator(
> TokenNameFinderFactory.java:118)
> at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:333)
> at com.damiano.parser.trainer.NER.compileNER(NER.java:161)
> at com.damiano.parser.trainer.NER.main(NER.java:136)
>
> Caused by: java.lang.InstantiationException:
> com.damiano.parser.generator.SpanFeatureGenerator
> at java.lang.Class.newInstance(Class.java:427)
> at
> opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(
> ExtensionLoader.java:70)
> ... 14 more
>
> Caused by: java.lang.NoSuchMethodException:
> com.damiano.parser.generator.SpanFeatureGenerator.()
> at java.lang.Class.getConstructor0(Class.java:3082)
> at java.lang.Class.newInstance(Class.java:412)
> ... 15 more
>
> the xml is:
>
> 
>   
> 
>   
> 
>   
>   
> 
>   
>   
>   
>   
>   
>prefix="name" finder="blablabla" prevWindowSize="3" nextWindowSize="3"/>
> 
>   
> 
>
> What can i do?
> Thank you!
>
> Damiano
>


Re: ContextGenerator

2016-10-24 Thread Joern Kottmann
Hello,

the ContextGenerator is not used much anymore and was replaced with context
generators which are specific for a component.
It think it we can safely make it generic, and the change wouldn't break
backward compatibility anyway.

Jörn

On Fri, Oct 21, 2016 at 3:40 PM, Russ, Daniel (NIH/CIT) [E] <
dr...@mail.nih.gov> wrote:

> Hello,
> Can we please make ContextGenerator a Generic type?  I open a JIRA
> issue (OPENNLP-870).  It is a simple fix.  I can try to send a git diff.
> Daniel
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda, MD 20892-5624
>
>


Re: Access to Git

2016-10-21 Thread Joern Kottmann
Infra is aware of the problem and working on it. Currently opennlp.git
doesn't sync with Github and we don't get commit mails.

The website still needs to be updated, currently we don't have svn write
access. So we can't commit to it.

Jörn

On Mon, Oct 3, 2016 at 9:16 AM, Rodrigo Agerri <rage...@apache.org> wrote:

> The first one
>
> Fri Sep 23 13:59:00 2016 +0200
>
> the last one
>
> Wed Sep 28 14:54:04 2016 +0200
>
> Cheers,
>
> R
>
> On Fri, Sep 30, 2016 at 1:07 PM, Tommaso Teofili
> <tommaso.teof...@gmail.com> wrote:
> > when did you push them ? Another project I'm involved in had the very
> same
> > problem, after contacting infra@ and doing a trivial commit the mirror
> > sync'ed again.
> >
> > Regards,
> > Tommaso
> >
> > Il giorno ven 30 set 2016 alle ore 13:02 Rodrigo Agerri <
> rage...@apache.org>
> > ha scritto:
> >
> >> Hello,
> >>
> >> I have committed and push some stuff using the git repo, but it
> >> appears not in the github mirror
> >>
> >> https://github.com/apache/opennlp
> >>
> >> or in the svn repo
> >>
> >> http://svn.apache.org/viewvc/opennlp/trunk/
> >>
> >> it does however appear in the original git repo
> >>
> >> https://git-wip-us.apache.org/repos/asf?p=opennlp.git;a=summary
> >>
> >> Is this intentional?
> >>
> >> Cheers,
> >>
> >> Rodrigo
> >>
> >> On Mon, Sep 19, 2016 at 11:50 PM, Joern Kottmann <kottm...@gmail.com>
> >> wrote:
> >> > The opennlp-addons repo is now also available, and opennlp-sandbox
> will
> >> > be available soon.
> >> >
> >> > Jörn
> >> >
> >> >
> >> > On Thu, 2016-09-15 at 01:12 +0200, Joern Kottmann wrote:
> >> >> Sorry, it took me a little to figure this out.
> >> >>
> >> >> This link explains how it works:
> >> >> https://reference.apache.org/committer/git
> >> >>
> >> >> > The reponame is opennlp, we will soon also have the other repos
> >> > opennlp-addons and opennlp-sandbox.
> >> >>
> >> >> Jörn
> >> >>
> >> >> > > On Fri, Sep 9, 2016 at 10:58 PM, Joern Kottmann <
> kottm...@gmail.com
> >> >
> >> > wrote:
> >> >> > > > Hello, yes you can use it. The add-ons and other things are not
> >> > setup yet as far as I know, have to ping the infra team about it.
> >> >> > Please have a look at the issue I posted to see how to access it.
> >> >> > I will work on this on Monday.
> >> >> > HTH
> >> >> >
> >> >> > Jörn
> >> >> >
> >> >> > > > > > On Sep 9, 2016 19:10, "William Colen" <
> >> william.co...@gmail.com>
> >> > wrote:
> >> >> > > Hello,
> >> >> > >
> >> >> > >
> >> >> > > Is the Git repository ready for use?
> >> >> > >
> >> >> > > Do we need to wait for it to develop new stuff?
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > Thank you,
> >> >> > >
> >> >> > > William
> >> >> > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >>
>


Re: Moving brat annotator to opennlp.git

2016-10-19 Thread Joern Kottmann
There is a dedicated servlet which implements exactly the protocol brat
requires. We can extend it to make it available for other tools.

Do you know any other annotation tools we might want to support? As far as
I am aware there is brat and not much else.

We should add support for the POS Tagger and Chunker, also in the formats
package. Shouldn't be too much work.

Jörn


On Wed, Oct 19, 2016 at 8:42 PM, William Colen <william.co...@gmail.com>
wrote:

> +1
>
> Do you think latter we can expand the annotator server to other tools?
>
>
> 2016-10-19 7:05 GMT-02:00 Madhawa Kasun Gunasekara <madhaw...@gmail.com>:
>
> > +1
> >
> > Madhawa
> >
> > On Wed, Oct 19, 2016 at 2:20 PM, "Shuo Xu" <pzc...@gmail.com> wrote:
> >
> > > +1
> > >
> > >
> > > On Wed, Oct 19, 2016 at 12:46 AM, Joern Kottmann <kottm...@gmail.com>
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > what do you think about including the brat ner annotator in the 1.6.1
> > > > release?
> > > >
> > > > I believe it is important that we include it to allow our users to
> > easier
> > > > run custom annotation projects, as part of the move we need to extend
> > the
> > > > documentation so everyone can easily get it up and running and
> > understand
> > > > how it is supposed to work.
> > > >
> > > > Jörn
> > > >
> > >
> > >
> > >
> > > --
> > > 徐硕 XU Shuo
> > > 中国科学技术信息研究所  Institute of Scientific and Technical
> > Information
> > > of China (ISTIC)
> > > 北京市海淀区复兴路15号  No. 15 Fuxing Rd., Haidian District, Beijing
> > > 100038, P.R. China
> > > 电话:+86-10-58882447(O)  Tel: +86-10-58882447 (O)
> > > BLOG:http://blog.sciencenet.cn/u/xiaohai2008
> > > E-mail: "XU Shuo"
> > >"XU Shuo"
> > >
> >
>


Moving brat annotator to opennlp.git

2016-10-18 Thread Joern Kottmann
Hello all,

what do you think about including the brat ner annotator in the 1.6.1
release?

I believe it is important that we include it to allow our users to easier
run custom annotation projects, as part of the move we need to extend the
documentation so everyone can easily get it up and running and understand
how it is supposed to work.

Jörn


Re: Morfologik Addon

2016-10-13 Thread Joern Kottmann
We could distribute it with our main release, similar to how we do with
opennlp-uima. I think that would make sense. If people would like to use it
they can add it as an extra dependency.

There are probably also other thing we can distribute in a similar fashion
with the next release.

Jörn

On Fri, Jul 15, 2016 at 3:34 PM, William Colen 
wrote:

> Not only licensing, but also I think we try to keep OpenNLP without
> external dependencies. The Morfologik also has some dependencies itself.
>
>
> 2016-07-15 4:55 GMT-03:00 Rodrigo Agerri :
>
> > Great stuff, William.
> >
> > I have been using Morfologik stemming for a long time and when we
> > included it we put it as an addon. I assume that the reason was its
> > license, but reading Morfologik license it is not clear to me why is
> > is not Apache compatible.
> >
> > If it is, it would be nice to include it directly in OpenNLP.
> >
> > Can anyone shed any light on this?
> >
> > Thanks,
> >
> > R
> >
> > On Fri, Jul 15, 2016 at 12:02 AM, William Colen  >
> > wrote:
> > > Hello,
> > >
> > > A while back we started working on a Morfologik Addon.
> > >
> > > http://svn.apache.org/viewvc/opennlp/addons/
> > >
> > > I checked it out last week and notice it was outdated, specially
> because
> > it
> > > was not using the latest Morfologik version. Also it was missing
> > > documentation.
> > >
> > > You can find more about Morfologik here:
> > > https://github.com/morfologik/morfologik-stemming
> > >
> > > Morfologik provides tools for finite state automata (FSA) construction
> > and
> > > dictionary-based morphological dictionaries.
> > >
> > > The Morfologik Addon implements some OpenNLP interfaces and extends
> some
> > > classes to make it easier to use of FSA Morfologik dictionaries:
> > >
> > >- opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
> > >   - Extends: opennlp.tools.postag.POSTaggerFactory
> > >   - Helps creating a POSTagger model with an embedded TagDictionary
> > >   based on FSA
> > >- opennlp.morfologik.tagdict.MorfologikTagDictionary
> > >- Implements: opennlp.tools.postag.TagDictionary
> > >   - A TagDictionary based on FSA is much smaller than the defaul
> XML
> > >   based, and consumes less memory.
> > >- opennlp.morfologik.lemmatizer.MorfologikLemmatizer
> > >- Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
> > >   - A dictionary based lemmatizer that uses FSA dictionary.
> > >
> > > It also provides a command line interface that allows:
> > >
> > >- MorfologikDictionaryBuilder
> > >   - builds a binary POS Dictionary using Morfologik
> > >- XMLDictionaryToTable
> > >   - reads an OpenNLP XML tag dictionary and outputs it in a tab
> > >   separated file that can be built into a FSA dictionary
> > >
> > >
> > > In a project I developed it was of great help. The TAG Dictionary for
> POS
> > > Tag was huge (something like 50 MB), requiring a lot of memory.
> > > Migrating it to a FSA dictionary allowed not only a smaller model, but
> > also
> > > I could use the model without the need to increase the JVM memory.
> > >
> > > More here:
> > >
> > https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+
> morfologik-addon
> > >
> > > Hope it will be helpful.
> > >
> > > William
> >
>


Re: Access to Git

2016-09-19 Thread Joern Kottmann
The opennlp-addons repo is now also available, and opennlp-sandbox will
be available soon.

Jörn


On Thu, 2016-09-15 at 01:12 +0200, Joern Kottmann wrote:
> Sorry, it took me a little to figure this out.
> 
> This link explains how it works:
> https://reference.apache.org/committer/git
> 
> > The reponame is opennlp, we will soon also have the other repos
opennlp-addons and opennlp-sandbox.
> 
> Jörn
> 
> > > On Fri, Sep 9, 2016 at 10:58 PM, Joern Kottmann <kottm...@gmail.com>
wrote:
> > > > Hello, yes you can use it. The add-ons and other things are not
setup yet as far as I know, have to ping the infra team about it.
> > Please have a look at the issue I posted to see how to access it.
> > I will work on this on Monday.
> > HTH 
> > 
> > Jörn 
> > 
> > > > > > On Sep 9, 2016 19:10, "William Colen" <william.co...@gmail.com>
wrote:
> > > Hello,
> > > 
> > > 
> > > Is the Git repository ready for use?
> > > 
> > > Do we need to wait for it to develop new stuff?
> > > 
> > > 
> > > 
> > > Thank you,
> > > 
> > > William
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> 
> 


Re: Access to Git

2016-09-14 Thread Joern Kottmann
Sorry, it took me a little to figure this out.

This link explains how it works:
https://reference.apache.org/committer/git

The reponame is opennlp, we will soon also have the other repos
opennlp-addons and opennlp-sandbox.

Jörn

On Fri, Sep 9, 2016 at 10:58 PM, Joern Kottmann <kottm...@gmail.com> wrote:

> Hello, yes you can use it. The add-ons and other things are not setup yet
> as far as I know, have to ping the infra team about it.
>
> Please have a look at the issue I posted to see how to access it.
>
> I will work on this on Monday.
>
> HTH
> Jörn
>
> On Sep 9, 2016 19:10, "William Colen" <william.co...@gmail.com> wrote:
>
>> Hello,
>>
>> Is the Git repository ready for use?
>> Do we need to wait for it to develop new stuff?
>>
>> Thank you,
>> William
>>
>


Re: Access to Git

2016-09-09 Thread Joern Kottmann
Hello, yes you can use it. The add-ons and other things are not setup yet
as far as I know, have to ping the infra team about it.

Please have a look at the issue I posted to see how to access it.

I will work on this on Monday.

HTH
Jörn

On Sep 9, 2016 19:10, "William Colen"  wrote:

> Hello,
>
> Is the Git repository ready for use?
> Do we need to wait for it to develop new stuff?
>
> Thank you,
> William
>


Re: Is sentence detection process really needed?

2016-08-26 Thread Joern Kottmann
The name finder has the concept of "adaptive data" in the feature
generation. The feature generators can remember things from previous
sentences and use it to generate features based on it. Usually that can
help with the recognition rate if you have names that are repeated.  You
can tweak this to your data, or just pass in the entire document.

Jörn

On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta 
wrote:

> Hi!
> Yes I can train a good model (sure It will takes a lot of time), i have 30k
> resumes. So the "data" isnt a problem.
> I thought about many things, i am also creating a custom features
> generator, with dictionary too (for names) and regex for Birthday,  then
> the machine learning will look at their contexts.
> So now i need to separate the sentences to create a custom model.
> At this point i will not try with one per line CV.
>
> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]"  ha
> scritto:
>
> Hi Damiano,
>I am not sure that the NameFinder will be effective as-is for you.  Do
> you have training data (and I mean a lot of training data)?  You need to
> consider what feature are useful in your case.  You might consider a
> feature such as line number on the page (since people tend to put their
> name on the top or second line), maybe the font-size.  You can add a
> dictionary of common names and have a feature “inDictionary”. You will have
> to use your domain knowledge to help you here.
>
>   For birthday you may want to consider using regex to pick out dates.
> Then look at the context around the date (words before/after, remove
> graduated or if another date just before) or maybe years before present
> year (if you are looking at resumes, you probably won’t find any 5 year
> olds or 200 year olds.
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 26, 2016, at 5:57 AM, Damiano Porta > wrote:
>
> Hi Daniel!
>
> Thank you so much for your opinion.
> It makes perfectly sense. But i am still a bit confused about the length of
> the sentences.
> In a resume there are many names, dates etc etc. So my doubt is regarding
> the structure of the sentences because they follow specific patterns
> sometimes.
>
> For example i need to extract the personal name, (Who wrote the resume) the
> Birthday etc etc.
>
> As You know there are many names and dates inside a resume so i thought
> about to write the entire resume as sentence to also train the "position"
> less or more of the entities. If i "decompose" all the resume into
> sentences i will lose this information. No?
>
> Damiano
>
> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]"  > ha
> scritto:
>
> Hi Damiano,
>
> Everyone can feel feel to correct my ignorance but I view the the
> name finder as follows.
>
> I look at it as walking down the sentence and classifying words as
> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> John eat the stew”.  Starting with the first word in the sentence decide
> what are the odds that the first word starts a name (given that it is the
> first word happens to be “Did” in a sentence, with a capital but not all
> caps) starts a person’s name.  Then go to then next word in the sentence.
> If the first word was not in a name, what are the odds that the second word
> starts a name (given that the previous word did not start a name, the word
> starts with a capital (but not all capital), the word is John, and the
> previous word is “Did”).  If it decides that we are starting a name at
> “John”, we are now looking for the end.  What are the odds that “eat” is
> part of the name given that [“Did”: was not part of the name, was
> capitalized] and that [“John”: was the first word in the name, was
> capitalized].   You are essentially classifying [Did <- OTHER] [John
> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
> Smith eat the stew”.  You would have [Did <- OTHER] [John
> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
> other features other than just word, previous word, and the shape (first
> letter capitalized, all letters capitalized).  I think the name finder uses
> part of speech also.
>
>
>So you see that it is not a name lookup table, but dependent on the
> previous classification of words earlier in the sentence.  Therefore, you
> must have sentences. Does that help?
> Daniel
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Office of Intramural Research
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Aug 25, 

Re: Migrate to Git?

2016-08-19 Thread Joern Kottmann
I don't see the advantage of having multiple repositories, because that
makes it harder to check it out and move things around without loosing
history (git mv).

Why do you think it is better?

Jörn

On Thu, Aug 18, 2016 at 4:33 PM, Chris Mattmann <mattm...@apache.org> wrote:

> Fantastic, Joern! I have some SentimentAnalysis stuff to hopefully commit
> and
> get refactored. Hopefully after that’s done we can ship a release soon and
> publish to Central.
>
>
>
> On 8/18/16, 5:50 AM, "Joern Kottmann" <kottm...@gmail.com> wrote:
>
> We made some progress here, the repository is now switched to git.
>
> Please have a look here:
> https://issues.apache.org/jira/browse/INFRA-12209
>
> And there are couple of things we have to do now:
> https://issues.apache.org/jira/browse/OPENNLP-860
>
> The new repository currently only contains the trunk and not the other
> stuff like addons, site and sandbox,
> I already commented on the infra issue, we might want to change the
> layout
> of our repository a bit.
> Any thoughts on it?
>
> The old layout is:
> addons
> trunk
> sandbox
> site
>
> BR,
> Jörn
>
> On Tue, Jul 5, 2016 at 3:11 AM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> > Hi Jörn,
> >
> > #3 is a mirror on Github of our writeable Git repo from #1. Users
> > can submit PRs to it, and then it will flow through to dev list in
> > the form of an email that links to information that we can use to
> > easily merge into our writeable ASF repo. Once merged, it will sync
> > out to Github and close the PR.
> >
> > HTH!
> >
> > Cheers,
> > Chris
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Director, Information Retrieval and Data Science Group (IRDS)
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > ++
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 7/4/16, 1:23 PM, "Joern Kottmann" <kottm...@gmail.com> wrote:
> >
> > >Can you explain 3, is that a writable mirror at Github?
> > >
> > >Jörn
> > >
> > >On Mon, 2016-07-04 at 15:35 +, Mattmann, Chris A (3980) wrote:
> > >> My +1 as well..I would suggest, specifically:
> > >>
> > >> 1. Use git-wp
> > >> 2. Borrow and adapt this guide which suggests how to do it
> > >> (i’m happy to adapt)
> > >> http://wiki.apache.org/tika/UsingGit
> > >> 3. Turn on writeable git wp mirror’ing to apache/opennlp
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >> 
> ++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattm...@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> 
> ++
> > >> Director, Information Retrieval and Data Science Group (IRDS)
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> WWW: http://irds.usc.edu/
> > >> 
> ++
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > &

Re: Migrate to Git?

2016-08-18 Thread Joern Kottmann
We made some progress here, the repository is now switched to git.

Please have a look here:
https://issues.apache.org/jira/browse/INFRA-12209

And there are couple of things we have to do now:
https://issues.apache.org/jira/browse/OPENNLP-860

The new repository currently only contains the trunk and not the other
stuff like addons, site and sandbox,
I already commented on the infra issue, we might want to change the layout
of our repository a bit.
Any thoughts on it?

The old layout is:
addons
trunk
sandbox
site

BR,
Jörn

On Tue, Jul 5, 2016 at 3:11 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Jörn,
>
> #3 is a mirror on Github of our writeable Git repo from #1. Users
> can submit PRs to it, and then it will flow through to dev list in
> the form of an email that links to information that we can use to
> easily merge into our writeable ASF repo. Once merged, it will sync
> out to Github and close the PR.
>
> HTH!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++
>
>
>
>
>
>
>
>
>
> On 7/4/16, 1:23 PM, "Joern Kottmann" <kottm...@gmail.com> wrote:
>
> >Can you explain 3, is that a writable mirror at Github?
> >
> >Jörn
> >
> >On Mon, 2016-07-04 at 15:35 +, Mattmann, Chris A (3980) wrote:
> >> My +1 as well..I would suggest, specifically:
> >>
> >> 1. Use git-wp
> >> 2. Borrow and adapt this guide which suggests how to do it
> >> (i’m happy to adapt)
> >> http://wiki.apache.org/tika/UsingGit
> >> 3. Turn on writeable git wp mirror’ing to apache/opennlp
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Director, Information Retrieval and Data Science Group (IRDS)
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> WWW: http://irds.usc.edu/
> >> ++
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 7/4/16, 7:36 AM, "Joern Kottmann" <kottm...@gmail.com> wrote:
> >>
> >> > Hello all,
> >> >
> >> > do we still want to do this? Has been a while since we discussed
> >> > it.
> >> > I am happy to get it done if we reach consensus on it again.
> >> >
> >> > My +1 again.
> >> >
> >> > Jörn
> >> >
> >> > On Thu, Dec 20, 2012 at 4:40 PM, Tommaso Teofili <tommaso.teofili@g
> >> > mail.com>
> >> > wrote:
> >> >
> >> > > in my opinion that would be good, +1
> >> > > Tommaso
> >> > >
> >> > >
> >> > > 2012/12/19 Jörn Kottmann <kottm...@gmail.com>
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > I heard at ApacheCon Europe that it should be possible to
> >> > > > migrate from
> >> > > > Subverion to Git.
> >> > > >
> >> > > > Is there any interest in doing that? If we decide to do it I
> >> > > > suggest to
> >> > > > wait until the
> >> > > > 1.5.3 release is done so we have a bit time to also migrate our
> >> > > > build
> >> > > > process.
> >> > > >
> >> > > > Do have all committers experience with git?
> >> > > >
> >> > > > Jörn
> >> > > >
> >> > >
>


Re: Migrate to Git?

2016-07-04 Thread Joern Kottmann
Thanks for your advice, if there are no concerns I will follow Chris
suggestion.

The first step is to get us setup on git-wp. I will fill an issue with
infra to do this for us.

Jörn

On Mon, 2016-07-04 at 15:35 +, Mattmann, Chris A (3980) wrote:
> My +1 as well..I would suggest, specifically:
> 
> 1. Use git-wp
> 2. Borrow and adapt this guide which suggests how to do it
> (i’m happy to adapt)
> http://wiki.apache.org/tika/UsingGit
> 3. Turn on writeable git wp mirror’ing to apache/opennlp
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 7/4/16, 7:36 AM, "Joern Kottmann" <kottm...@gmail.com> wrote:
> 
> > Hello all,
> > 
> > do we still want to do this? Has been a while since we discussed it.
> > I am happy to get it done if we reach consensus on it again.
> > 
> > My +1 again.
> > 
> > Jörn
> > 
> > On Thu, Dec 20, 2012 at 4:40 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
> > wrote:
> > 
> > > in my opinion that would be good, +1
> > > Tommaso
> > > 
> > > 
> > > 2012/12/19 Jörn Kottmann <kottm...@gmail.com>
> > > 
> > > > Hi all,
> > > > 
> > > > I heard at ApacheCon Europe that it should be possible to migrate from
> > > > Subverion to Git.
> > > > 
> > > > Is there any interest in doing that? If we decide to do it I suggest to
> > > > wait until the
> > > > 1.5.3 release is done so we have a bit time to also migrate our build
> > > > process.
> > > > 
> > > > Do have all committers experience with git?
> > > > 
> > > > Jörn
> > > > 
> > > 


Re: Migrate to Git?

2016-07-04 Thread Joern Kottmann
Hello all,

do we still want to do this? Has been a while since we discussed it.
I am happy to get it done if we reach consensus on it again.

My +1 again.

Jörn

On Thu, Dec 20, 2012 at 4:40 PM, Tommaso Teofili 
wrote:

> in my opinion that would be good, +1
> Tommaso
>
>
> 2012/12/19 Jörn Kottmann 
>
> > Hi all,
> >
> > I heard at ApacheCon Europe that it should be possible to migrate from
> > Subverion to Git.
> >
> > Is there any interest in doing that? If we decide to do it I suggest to
> > wait until the
> > 1.5.3 release is done so we have a bit time to also migrate our build
> > process.
> >
> > Do have all committers experience with git?
> >
> > Jörn
> >
>


Re: Model to detect the gender

2016-07-04 Thread Joern Kottmann
Hello,

there are also other interesting properties e.g. person title (e.g.
professor, doctor), job title/position,
company legal form. And much more for other entity types.

Maybe it would be worth it to build a dedicated component to extract
properties from entities.

Jörn

On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi  wrote:

> Hi,
>
> Sorry for my late reply. I didn't understand well your last email, but here
> is what I meant:
>
> Given a simple dictionary you have that has the following columns:
>
> Name   Type   Gender
> Agatha First   F
> JohnFirst   M
> Smith  Both   B
>
> where:
> - "First" refers to first name, "Last" (not in the example) refers to last
> name, and Both means it can be both.
> - "F" refers to female, "M" refers to males, and "B" refers to both
> genders.
>
> and given the following two sentences:
>
> 1. "It was nice meeting you John. I hope we meet again soon."
>
> 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
> she knows something"
>
> In the first example, when you check in the dictionary, the name "John" is
> a male name, so no need to go any further.
> However, in the second example, the name "Smith", which is a family name in
> our case, can be fit for both, males and females. Therefore, we need to
> extract features from the surrounding context and perform a classification
> task.
> Here are some of the features I think they would be interesting to use:
>
> . Presence of a male initiative before the word {True, False}
> . Presence of a female initiative before the word {True, False}
>
> . Gender of the first personal pronoun (subject or object form) to the
> right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the right (in
> words) Values=NUMERIC
> . Gender of the second personal pronoun to the right of the
> name Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun right
>  Values=NUMERIC
> . Gender of the third personal pronoun to the right of the
> name  Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the third personal pronoun right (in
> words)  Values=NUMERIC
>
> . Gender of the first personal pronoun (subject or object form) to the left
> of the name   Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the left (in
> words)Values=NUMERIC
> . Gender of the second personal pronoun to the left of the
> nameValues={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun left
> Values=NUMERIC
> . Gender of the third personal pronoun to the left of the
> nameValues={MALE, FEMALE,
> UNCERTAIN, EMPTY}
> . Distance between the name and the third personal pronoun left (in
> words)Values=NUMERIC
>
> In the second example here are the values you have for your features
>
> F1 = False
> F2 = True
> F3 = UNCERTAIN
> F4 = 1
> F5 = FEMALE
> F6 = 3
> F7 = FEMALE
> F8 = 4
> F9 = UNCERTAIN
> F10 = 2
> F11 = EMPTY
> F12 = 0
> F13 = EMPTY
> F14 = 0
>
> Of course the choice of features depends on the type of data, and the
> features themselves might not work well for some texts such as ones
> collected from twitter for example.
>
> I hope this help you.
>
> Best regards
>
> Mondher
>
>
> On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta 
> wrote:
>
> > Hi Mondher,
> > could you give me a raw example to understand how i should train the
> > classifier model?
> >
> > Thank you in advance!
> > Damiano
> >
> >
> > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi :
> >
> > > Hi,
> > >
> > > I would recommend a hybrid approach where, in a first step, you use a
> > plain
> > > dictionary and then perform the classification if needed.
> > >
> > > It's straightforward, but I think it would present better performances
> > than
> > > just performing a classification task.
> > >
> > > In the first step you use a dictionary of names along with an attribute
> > > specifying whether the name fits for males, females or both. In case
> the
> > > name fits for males or females exclusively, then no need to go any
> > further.
> > >
> > > If the name fits for both genders, or is a family name etc., a second
> > step
> > > is needed where you extract features from the context (surrounding
> words,
> > > etc.) and perform a classification task using any machine learning
> > > algorithm.
> > >
> > > Another way would be using the information itself (whether the name
> fits
> > > for males, females or both) as a feature when you perform the
> > > classification.
> > 

Re: Performances of OpenNLP tools

2016-07-04 Thread Joern Kottmann
You should get a copy of OntoNotes (it is for free) and OpenNLP already has
support to train models on it.
So the entry barrier to get started with this corpus is very low.

Jörn

On Wed, Jun 29, 2016 at 11:22 AM, Anthony Beylerian <
anthony.beyler...@gmail.com> wrote:

> How about we keep track of the sets used for performance evaluation and
> results in this doc for now:
>
>
> https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing
>
> Will try to take a better look at OntoNotes and what to use from it.
> Otherwise, if anyone would like to suggest proper data-sets for testing
> each component that would be really helpful
>
> Anthony
>
> On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <kottm...@gmail.com>
> wrote:
>
> > It would be nice to get MASC support into the OpenNLP formats package.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <
> jasonbaldri...@gmail.com
> > >
> > wrote:
> >
> > > Jörn is absolutely right about that. Another good source of training
> data
> > > is MASC. I've got some instructions for training models with MASC here:
> > >
> > > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> > >
> > > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> > functionality,
> > > so the instructions there should make it fairly straightforward to
> adapt
> > > MASC data to OpenNLP.
> > >
> > > -Jason
> > >
> > > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottm...@gmail.com>
> wrote:
> > >
> > > > There are some research papers which study and compare the
> performance
> > of
> > > > NLP toolkits, but be careful often they don't train the NLP tools on
> > the
> > > > same data and the training data makes a big difference on the
> > > performance.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com>
> > > > wrote:
> > > >
> > > > > Just don't use the very old existing models, to get good results
> you
> > > have
> > > > > to train on your own data, especially if the domain of the data
> used
> > > for
> > > > > training and the data which should be processed doesn't match. The
> > old
> > > > > models are trained on 90s news, those don't work well on todays
> news
> > > and
> > > > > probably much worse on tweets.
> > > > >
> > > > > OntoNots is a good place to start if the goal is to process news.
> > > OpenNLP
> > > > > comes with build-in support to train models from OntoNotes.
> > > > >
> > > > > Jörn
> > > > >
> > > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > > > >
> > > > >> This sounds like a fantastic idea.
> > > > >>
> > > > >> ++
> > > > >> Chris Mattmann, Ph.D.
> > > > >> Chief Architect
> > > > >> Instrument Software and Science Data Systems Section (398)
> > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > >> Office: 168-519, Mailstop: 168-527
> > > > >> Email: chris.a.mattm...@nasa.gov
> > > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> ++
> > > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > > >> Adjunct Associate Professor, Computer Science Department
> > > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > > >> WWW: http://irds.usc.edu/
> > > > >> ++
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > > anthonybeyler...@hotmail.com
> > > > >
> > > > >> wrote:
> > > > >>
> &

Re: DeepLearning4J as a ML for OpenNLP

2016-07-01 Thread Joern Kottmann
Hello,

the people from deeplearning4j are rather nice and I discussed with them
for a while how
it can be used for OpenNLP. The state back then was that they don't
properly support the
sparse feature vectors we use in OpenNLP today. Instead we would need to
use word embeddings.
In the end I never tried it out but I think it might not be very difficult
to get everything wired together,
the most difficult part is probably to find a deep learning model setup
which works well.

Jörn

On Tue, Jun 28, 2016 at 11:23 PM, William Colen 
wrote:

> Hi,
>
> Do you think it would be possible to implement a ML based on DL4J?
>
> http://deeplearning4j.org/
>
> Thank you
> William
>


Re: SentimentAnalysisParser updates

2016-07-01 Thread Joern Kottmann
Hello,

would be nice to get a pull request for the work you did.

Thanks,
Jörn

On Wed, Jun 29, 2016 at 8:08 PM, Anastasija Mensikova <
mensikova.anastas...@gmail.com> wrote:

> Hi everyone,
>
> Some updates on our SentimentAnalysisParser.
>
> For the past week I worked on making a pull request to Tika and on looking
> for the right categorical open datasets to enhance my
> SentimentAnalysisParser and make it categorical. Thanks to your help and
> some reasearch, we have decided on using SentiWordNet and Stanford
> Sentiment Treebank to create Facebook reaction-like categories for
> sentiment analysis.
>
> My next steps will include: creating a pull request to OpenNLP, work on
> making my parser categorical and implement AbstractEvaluatorTool and
> AbstractCrossValidatorTool to yield some results that can be used on our
> GH-page in the form of D3 graphs.
>
> Thank you for all of your help and have a great rest of the week!
>
> Thank you,
> Anastasija
>


Re: Performances of OpenNLP tools

2016-06-22 Thread Joern Kottmann
It would be nice to get MASC support into the OpenNLP formats package.

Jörn

On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <jasonbaldri...@gmail.com>
wrote:

> Jörn is absolutely right about that. Another good source of training data
> is MASC. I've got some instructions for training models with MASC here:
>
> https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
>
> Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality,
> so the instructions there should make it fairly straightforward to adapt
> MASC data to OpenNLP.
>
> -Jason
>
> On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottm...@gmail.com> wrote:
>
> > There are some research papers which study and compare the performance of
> > NLP toolkits, but be careful often they don't train the NLP tools on the
> > same data and the training data makes a big difference on the
> performance.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com>
> > wrote:
> >
> > > Just don't use the very old existing models, to get good results you
> have
> > > to train on your own data, especially if the domain of the data used
> for
> > > training and the data which should be processed doesn't match. The old
> > > models are trained on 90s news, those don't work well on todays news
> and
> > > probably much worse on tweets.
> > >
> > > OntoNots is a good place to start if the goal is to process news.
> OpenNLP
> > > comes with build-in support to train models from OntoNotes.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > >> This sounds like a fantastic idea.
> > >>
> > >> ++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattm...@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++
> > >> Director, Information Retrieval and Data Science Group (IRDS)
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> WWW: http://irds.usc.edu/
> > >> ++
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> anthonybeyler...@hotmail.com
> > >
> > >> wrote:
> > >>
> > >> >+1
> > >> >
> > >> >Maybe we could put the results of the evaluator tests for each
> > component
> > >> somewhere on a webpage and on every release update them.
> > >> >This is of course provided there are reasonable data sets for testing
> > >> each component.
> > >> >What do you think?
> > >> >
> > >> >Anthony
> > >> >
> > >> >> From: mondher.bouaz...@gmail.com
> > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > >> >> Subject: Re: Performances of OpenNLP tools
> > >> >> To: dev@opennlp.apache.org
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> Thank you for your replies.
> > >> >>
> > >> >> Please Jeffrey accept once more my apologies for receiving the
> email
> > >> twice.
> > >> >>
> > >> >> I also think it would be great to have such studies on the
> > >> performances of
> > >> >> OpenNLP.
> > >> >>
> > >> >> I have been looking for this information and checked in many
> places,
> > >> >> including obviously google scholar, and I haven't found any serious
> > >> studies
> > >> >> or reliable results. Most of the existing ones report the
> > performances
> > >> of
> > >> >> outdated releases of OpenNLP, and focus more on the execution time
> or
> > >> >&

Re: Performances of OpenNLP tools

2016-06-21 Thread Joern Kottmann
There are some research papers which study and compare the performance of
NLP toolkits, but be careful often they don't train the NLP tools on the
same data and the training data makes a big difference on the performance.

Jörn

On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com> wrote:

> Just don't use the very old existing models, to get good results you have
> to train on your own data, especially if the domain of the data used for
> training and the data which should be processed doesn't match. The old
> models are trained on 90s news, those don't work well on todays news and
> probably much worse on tweets.
>
> OntoNots is a good place to start if the goal is to process news. OpenNLP
> comes with build-in support to train models from OntoNotes.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> This sounds like a fantastic idea.
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/21/16, 12:13 AM, "Anthony Beylerian" <anthonybeyler...@hotmail.com>
>> wrote:
>>
>> >+1
>> >
>> >Maybe we could put the results of the evaluator tests for each component
>> somewhere on a webpage and on every release update them.
>> >This is of course provided there are reasonable data sets for testing
>> each component.
>> >What do you think?
>> >
>> >Anthony
>> >
>> >> From: mondher.bouaz...@gmail.com
>> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
>> >> Subject: Re: Performances of OpenNLP tools
>> >> To: dev@opennlp.apache.org
>> >>
>> >> Hi,
>> >>
>> >> Thank you for your replies.
>> >>
>> >> Please Jeffrey accept once more my apologies for receiving the email
>> twice.
>> >>
>> >> I also think it would be great to have such studies on the
>> performances of
>> >> OpenNLP.
>> >>
>> >> I have been looking for this information and checked in many places,
>> >> including obviously google scholar, and I haven't found any serious
>> studies
>> >> or reliable results. Most of the existing ones report the performances
>> of
>> >> outdated releases of OpenNLP, and focus more on the execution time or
>> >> CPU/RAM consumption, etc.
>> >>
>> >> I think such a comparison will help not only evaluate the overall
>> accuracy,
>> >> but also highlight the issues with the existing models (as a matter of
>> >> fact, the existing models fail to recognize many of the hashtags in
>> tweets:
>> >> the tokenizer splits them into the "#" symbol and a word that the PoS
>> >> tagger also fails to recognize).
>> >>
>> >> Therefore, building Twitter-based models would also be useful, since
>> many
>> >> of the works in academia / industry are focusing on Twitter data.
>> >>
>> >> Best regards,
>> >>
>> >> Mondher
>> >>
>> >>
>> >>
>> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
>> jasonbaldri...@gmail.com>
>> >> wrote:
>> >>
>> >> > It would be fantastic to have these numbers. This is an example of
>> >> > something that would be a great contribution by someone trying to
>> >> > contribute to open source and who is maybe just getting into machine
>> >> > learning and natural language processing.
>> >> >
>> >> > For Twitter-ish text, it'd be great to look at models trained and
>> evaluated
>> >> > on the Tweet NLP resources:
>> >> >
>> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
>> >> >

Re: Performances of OpenNLP tools

2016-06-21 Thread Joern Kottmann
Just don't use the very old existing models, to get good results you have
to train on your own data, especially if the domain of the data used for
training and the data which should be processed doesn't match. The old
models are trained on 90s news, those don't work well on todays news and
probably much worse on tweets.

OntoNots is a good place to start if the goal is to process news. OpenNLP
comes with build-in support to train models from OntoNotes.

Jörn

On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> This sounds like a fantastic idea.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 6/21/16, 12:13 AM, "Anthony Beylerian" 
> wrote:
>
> >+1
> >
> >Maybe we could put the results of the evaluator tests for each component
> somewhere on a webpage and on every release update them.
> >This is of course provided there are reasonable data sets for testing
> each component.
> >What do you think?
> >
> >Anthony
> >
> >> From: mondher.bouaz...@gmail.com
> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> >> Subject: Re: Performances of OpenNLP tools
> >> To: dev@opennlp.apache.org
> >>
> >> Hi,
> >>
> >> Thank you for your replies.
> >>
> >> Please Jeffrey accept once more my apologies for receiving the email
> twice.
> >>
> >> I also think it would be great to have such studies on the performances
> of
> >> OpenNLP.
> >>
> >> I have been looking for this information and checked in many places,
> >> including obviously google scholar, and I haven't found any serious
> studies
> >> or reliable results. Most of the existing ones report the performances
> of
> >> outdated releases of OpenNLP, and focus more on the execution time or
> >> CPU/RAM consumption, etc.
> >>
> >> I think such a comparison will help not only evaluate the overall
> accuracy,
> >> but also highlight the issues with the existing models (as a matter of
> >> fact, the existing models fail to recognize many of the hashtags in
> tweets:
> >> the tokenizer splits them into the "#" symbol and a word that the PoS
> >> tagger also fails to recognize).
> >>
> >> Therefore, building Twitter-based models would also be useful, since
> many
> >> of the works in academia / industry are focusing on Twitter data.
> >>
> >> Best regards,
> >>
> >> Mondher
> >>
> >>
> >>
> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> jasonbaldri...@gmail.com>
> >> wrote:
> >>
> >> > It would be fantastic to have these numbers. This is an example of
> >> > something that would be a great contribution by someone trying to
> >> > contribute to open source and who is maybe just getting into machine
> >> > learning and natural language processing.
> >> >
> >> > For Twitter-ish text, it'd be great to look at models trained and
> evaluated
> >> > on the Tweet NLP resources:
> >> >
> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >> >
> >> > And comparing to how their models performed, etc. Also, it's worth
> looking
> >> > at spaCy (Python NLP modules) for further comparisons.
> >> >
> >> > https://spacy.io/
> >> >
> >> > -Jason
> >> >
> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick 
> >> > wrote:
> >> >
> >> > > I saw the same question on the users list on June 17. At least I
> thought
> >> > it
> >> > > was the same question -- sorry if it wasn't.
> >> > >
> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> >> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> >> > >
> >> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
> >> > > > mins previously. Maybe some folks need some time to reply ^_^
> >> > > >
> >> > > > ++
> >> > > > Chris Mattmann, Ph.D.
> >> > > > Chief Architect
> >> > > > Instrument Software and Science Data Systems Section (398)
> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> > > > Office: 168-519, Mailstop: 168-527
> >> > > > Email: chris.a.mattm...@nasa.gov
> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> >> > > > ++
> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
> >> > > > Adjunct Associate Professor, Computer Science Department
> >> > > > University of Southern California, Los Angeles, 

Re: svn commit: r1731145 - in /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools: lemmatizer/ util/

2016-04-26 Thread Joern Kottmann
Hello Rodrigo,

you are adding a couple of java files in this commit, and I think more
in other commits for the lemmatizer.

All new java files must have the AL header. May you please add the
header to files where it is missing.

Thanks,
Jörn 


On Thu, 2016-02-18 at 21:02 +, rage...@apache.org wrote:
> Author: ragerri
> Date: Thu Feb 18 21:02:34 2016
> New Revision: 1731145
> 
> URL: http://svn.apache.org/viewvc?rev=1731145=rev
> Log:
> OPENNLP-760 adding factory and string utils to induce lemma classes
> 
> Added:
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmaSampleStream.java
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/Lemmatizer.java
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerEvaluationMoni
> tor.java
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerEvaluator.java
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerFactory.java
> Modified:
> opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/util/StringUtil.java
> 
> Added: opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmaSampleStream.java
> URL: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/mai
> n/java/opennlp/tools/lemmatizer/LemmaSampleStream.java?rev=1731145
> ew=auto
> =
> =
> --- opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmaSampleStream.java
> (added)
> +++ opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmaSampleStream.java
> Thu Feb 18 21:02:34 2016
> @@ -0,0 +1,49 @@
> +package opennlp.tools.lemmatizer;
> +
> +import java.io.IOException;
> +import java.util.ArrayList;
> +import java.util.List;
> +
> +import opennlp.tools.util.FilterObjectStream;
> +import opennlp.tools.util.ObjectStream;
> +import opennlp.tools.util.StringUtil;
> +
> +
> +/**
> + * Reads data for training and testing. The format consists of:
> + * word\tabpostag\tablemma.
> + * @version 2016-02-16
> + */
> +public class LemmaSampleStream extends FilterObjectStream LemmaSample> {
> +
> +  public LemmaSampleStream(ObjectStream samples) {
> +super(samples);
> +  }
> +
> +  public LemmaSample read() throws IOException {
> +
> +List toks = new ArrayList();
> +List tags = new ArrayList();
> +List preds = new ArrayList();
> +
> +for (String line = samples.read(); line != null &&
> !line.equals(""); line = samples.read()) {
> +  String[] parts = line.split("\t");
> +  if (parts.length != 3) {
> +System.err.println("Skipping corrupt line: " + line);
> +  }
> +  else {
> +toks.add(parts[0]);
> +tags.add(parts[1]);
> +String ses = StringUtil.getShortestEditScript(parts[0],
> parts[2]);
> +preds.add(ses);
> +  }
> +}
> +if (toks.size() > 0) {
> +  LemmaSample lemmaSample = new LemmaSample(toks.toArray(new
> String[toks.size()]), tags.toArray(new String[tags.size()]),
> preds.toArray(new String[preds.size()]));
> +  return lemmaSample;
> +}
> +else {
> +  return null;
> +}
> +  }
> +}
> 
> Added: opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/Lemmatizer.java
> URL: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/mai
> n/java/opennlp/tools/lemmatizer/Lemmatizer.java?rev=1731145=auto
> =
> =
> --- opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/Lemmatizer.java (added)
> +++ opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/Lemmatizer.java Thu Feb
> 18 21:02:34 2016
> @@ -0,0 +1,18 @@
> +package opennlp.tools.lemmatizer;
> +
> +/**
> + * The interface for lemmatizers.
> + */
> +public interface Lemmatizer {
> +
> +  /**
> +   * Generates lemma tags for the word and postag returning the
> result in an array.
> +   *
> +   * @param toks an array of the tokens
> +   * @param tags an array of the pos tags
> +   *
> +   * @return an array of lemma classes for each token in the
> sequence.
> +   */
> +  public String[] lemmatize(String[] toks, String tags[]);
> +
> +}
> 
> Added: opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerEvaluationMoni
> tor.java
> URL: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/mai
> n/java/opennlp/tools/lemmatizer/LemmatizerEvaluationMonitor.java?rev=
> 1731145=auto
> =
> =
> --- opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerEvaluationMoni
> tor.java (added)
> +++ opennlp/trunk/opennlp-
> tools/src/main/java/opennlp/tools/lemmatizer/LemmatizerEvaluationMoni
> tor.java Thu Feb 18 21:02:34 2016
> @@ -0,0 +1,12 @@
> +package opennlp.tools.lemmatizer;
> +
> +import 

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Joern Kottmann
The Large Movie Review Dataset might be interesting for this as well:
http://ai.stanford.edu/~amaas/data/sentiment/

Jörn

On Tue, Apr 26, 2016 at 4:26 PM, Anthony Beylerian <
anthony.beyler...@gmail.com> wrote:

> sentiment analysis discussion doc :
>
>
> https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing
>
> On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> > Hi,
> >
> > Sure here is the link:
> >
> > https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
> >
> > Sorry for the delay.
> >
> > Cheers,
> > Chris
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Director, Information Retrieval and Data Science Group (IRDS)
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > ++
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 4/26/16, 6:48 AM, "Anastasija Mensikova" <
> > mensikova.anastas...@gmail.com> wrote:
> >
> > >Hi everyone,
> > >
> > >
> > >Is the 9:40 ET hangout still happening? I just have to leave soon to go
> > to class.
> > >
> > >
> > >Thank you,
> > >Anastasija
> > >
> > >
> > >On 25 April 2016 at 23:39, Anastasija Mensikova
> > > wrote:
> > >
> > >Hi Chris,
> > >
> > >
> > >Yes, that's perfect. I'll be ready by 9:40am.
> > >
> > >
> > >Thank you,
> > >Anastasija
> > >
> > >
> > >On 25 April 2016 at 23:28, Mattmann, Chris A (3980)
> > > wrote:
> > >
> > >Hey Anastasija,
> > >
> > >To be honest 9am EST is a little aggressive, I will likely be able
> > >to do 6:40 am PT (am traveling back from DC as I type this) which
> > >is 9:40am ET.
> > >
> > >My GChat handle is chris.mattm...@gmail.com. I will create a hangout
> > >and send to the list please contact me at 6:40am PT.
> > >
> > >Cheers,
> > >Chris
> > >
> > >++
> > >Chris Mattmann, Ph.D.
> > >Chief Architect
> > >Instrument Software and Science Data Systems Section (398)
> > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >Office: 168-519, Mailstop: 168-527
> > >Email: chris.a.mattm...@nasa.gov
> > >WWW:
> > >http://sunset.usc.edu/~mattmann/ 
> > >++
> > >Director, Information Retrieval and Data Science Group (IRDS)
> > >Adjunct Associate Professor, Computer Science Department
> > >University of Southern California, Los Angeles, CA 90089 USA
> > >WWW: http://irds.usc.edu/
> > >++
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >On 4/25/16, 11:07 PM, "Anastasija Mensikova" <
> > mensikova.anastas...@gmail.com> wrote:
> > >
> > >>Hi everyone,
> > >>
> > >>
> > >>So is the hangout session tomorrow (Tuesday) at 6:30pm IST (9am EST)
> > confirmed or not?
> > >>
> > >>
> > >>Thank you,
> > >>Anastasija
> > >>
> > >>
> > >>On 25 April 2016 at 15:23, Madhawa Kasun Gunasekara
> > >> wrote:
> > >>
> > >>Hi all,
> > >>
> > >>
> > >>Shall we have the hangout session tomorrow (Tuesday) about 18:30 IST ?
> > >>
> > >>
> > >>Thanks,
> > >>
> > >>Madhawa
> > >>
> > >>
> > >>
> > >>
> > >>Madhawa
> > >>
> > >>
> > >>
> > >>
> > >>On Sun, Apr 24, 2016 at 10:33 PM, Mondher Bouazizi
> > >> wrote:
> > >>
> > >>Hi,
> > >>
> > >>I am sorry for my late reply.
> > >>
> > >>Given the time difference between Japan and USA, I think I won't be
> > >>available on weekdays. I will be available only on Friday/Saturday
> > morning
> > >>(9-10am EST).
> > >>
> > >>I am not sure if Chris is OK with that, we had our previous meetings on
> > >>Saturday mornings.
> > >>
> > >>Otherwise, please go ahead. I will join as soon as I can.
> > >>
> > >>Thanks.
> > >>
> > >>@Chris: my github ID is mondher-bouazizi
> > >>
> > >>Best regards,
> > >>
> > >>Mondher
> > >>
> > >>On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
> > >>mensikova.anastas...@gmail.com> wrote:
> > >>
> > >>> Hi Anthony,
> > >>>
> > >>> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday
> > (after
> > >>> 8:30am EST). Let me know when exactly!
> > >>>
> > >>> Thank you,
> > >>> Anastasija
> > >>>
> > >>> On 24 April 2016 at 03:02, Anthony Beylerian <
> > anthony.beyler...@gmail.com>
> > >>> wrote:
> > >>>
> >  Hi Anastasija,
> > 
> >  I'm not available by those times (00-07 JST).  I 

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Joern Kottmann
I will be able to join as well.

Jörn

On Tue, Apr 26, 2016 at 5:28 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Anastasija,
>
> To be honest 9am EST is a little aggressive, I will likely be able
> to do 6:40 am PT (am traveling back from DC as I type this) which
> is 9:40am ET.
>
> My GChat handle is chris.mattm...@gmail.com. I will create a hangout
> and send to the list please contact me at 6:40am PT.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
> On 4/25/16, 11:07 PM, "Anastasija Mensikova" <
> mensikova.anastas...@gmail.com> wrote:
>
> >Hi everyone,
> >
> >
> >So is the hangout session tomorrow (Tuesday) at 6:30pm IST (9am EST)
> confirmed or not?
> >
> >
> >Thank you,
> >Anastasija
> >
> >
> >On 25 April 2016 at 15:23, Madhawa Kasun Gunasekara
> > wrote:
> >
> >Hi all,
> >
> >
> >Shall we have the hangout session tomorrow (Tuesday) about 18:30 IST ?
> >
> >
> >Thanks,
> >
> >Madhawa
> >
> >
> >
> >
> >Madhawa
> >
> >
> >
> >
> >On Sun, Apr 24, 2016 at 10:33 PM, Mondher Bouazizi
> > wrote:
> >
> >Hi,
> >
> >I am sorry for my late reply.
> >
> >Given the time difference between Japan and USA, I think I won't be
> >available on weekdays. I will be available only on Friday/Saturday morning
> >(9-10am EST).
> >
> >I am not sure if Chris is OK with that, we had our previous meetings on
> >Saturday mornings.
> >
> >Otherwise, please go ahead. I will join as soon as I can.
> >
> >Thanks.
> >
> >@Chris: my github ID is mondher-bouazizi
> >
> >Best regards,
> >
> >Mondher
> >
> >On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
> >mensikova.anastas...@gmail.com> wrote:
> >
> >> Hi Anthony,
> >>
> >> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday (after
> >> 8:30am EST). Let me know when exactly!
> >>
> >> Thank you,
> >> Anastasija
> >>
> >> On 24 April 2016 at 03:02, Anthony Beylerian <
> anthony.beyler...@gmail.com>
> >> wrote:
> >>
> >>> Hi Anastasija,
> >>>
> >>> I'm not available by those times (00-07 JST).  I could make it by
> >>> Madhawa's proposal, but otherwise please go ahead, we may discuss some
> >>> other time.
> >>>
> >>> @Chris: github ID : beylerian
> >>>
> >>> Best,
> >>>
> >>> Anthony
> >>>
> >>>
> >>> Please find my github profile
> >https://github.com/madhawa-gunasekara <
> https://github.com/madhawa-gunasekara>
> >>>
> >>> Madhawa
> >>>
> >>> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <
> >>> madhaw...@gmail.com> wrote:
> >>>
> >>> > Hi Chris,
> >>> >
> >>> > I'm available on Tuesday & Wednesday after 6.00 pm IST.
> >>> >
> >>> > Thanks,
> >>> > Madhawa
> >>> >
> >>> > Madhawa
> >>> >
> >>> > On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova <
> >>> > mensikova.anastas...@gmail.com> wrote:
> >>> >
> >>> >> Hi Chris,
> >>> >>
> >>> >> Thank you very much for your email. I'm so excited to work with you!
> >>> >>
> >>> >> My Github name is amensiko.
> >>> >>
> >>> >> And yes, next week sounds good! I'm available on: Tuesday at 4:20pm
> >>> EST,
> >>> >> Thursday 11am - 2:30pm and 4:20 - 6pm EST, Friday 11am - 3pm EST.
> >>> >>
> >>> >> Thank you,
> >>> >> Anastasija
> >>> >>
> >>> >> On 23 April 2016 at 10:21, Mattmann, Chris A (3980) <
> >>> >> chris.a.mattm...@jpl.nasa.gov> wrote:
> >>> >>
> >>> >>> Hi Anastasija,
> >>> >>>
> >>> >>> Hope you are well. It’s now time to get started on the project.
> >>> >>> Monder, Anthony, Madhawa and I have been discussing ideas about
> >>> >>> how to proceed with the project and even developing a task list.
> >>> >>> Let’s get your tasks input into that list, and also coordinate.
> >>> >>>
> >>> >>> I also have an action to share some Spanish/English data to try
> >>> >>> and do cross lingual sentiment analysis.
> >>> >>>
> >>> >>> Are you available to chat this week?
> >>> >>>
> >>> >>> Cheers,
> >>> >>> Chris
> >>> >>>
> >>> >>> ++
> >>> >>> Chris Mattmann, Ph.D.
> >>> >>> Chief Architect
> >>> >>> Instrument Software and Science Data Systems Section (398)
> >>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> >>> Office: 168-519, Mailstop: 168-527
> >>> >>> Email: chris.a.mattm...@nasa.gov
> >>> >>> WWW:
> >http://sunset.usc.edu/~mattmann/ 

Re: Question about deprecated NameFinderME constructors

2016-03-08 Thread Joern Kottmann
There is a custom xml element where it can load a user defined class
 for feature generation.

So you would add an element like this:


https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen

I think we should remove the deprecated training methods so it is no longer
possible to train models which can't be loaded.

Jörn

On Mon, Mar 7, 2016 at 6:45 PM, Cohan Sujay Carlos  wrote:

> Dear Rodrigo,
>
> Thank you for the informative reply.
>
> I just wanted to say I feel there is a use-case that the new constructor
> still does not support.  Let me explain with an example.
>
> Let's first take the example of brown-feature.xml, which is defined as ...
>
> 
>   
> 
>   
> 
>   
>   
> 
>   
> 
>   
> 
>
> ... In this feature generator, I believe "window" maps to the
> WindowFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
> >
> and "token" maps to TokenFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html
> >
> .
>
> It's clear that we can create new feature generators that are combinations
> of existing feature generators.
>
> However, let's say I have a task / language where none of the existing
> feature generators or combinations work very well.
>
> Say, for example, that I want to create a new feature generator that pulls
> out morphemes from agglutinative South Indian languages ... let's call it
> "AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator".
>
> It's not clear how one could create XML tags for this feature generator
> using the new constructor.
>
> The same thing is easy to do programmatically using the old constructors ->
> I would just extend the AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> .
>
> So, I was wondering ... are we giving up some API flexibility and
> simplicity by removing the constructors that enable me to use subclasses of
> AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> while
> there is no easy way to create something like a
> AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use
> it as a feature generator in the NameFinderME using the new constructor's
> XML specification.
>
> Cohan Sujay Carlos
> Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com
>
> On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri  wrote:
>
> > Hi,
> >
> > You can do all those tasks by using the create method in the
> > TokenNameFinderFactory:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553=markup#l100
> >
> > For that you need to:
> >
> > 1. Provide the name of the factory class you are using, it could be
> > the same factory class: TokenNameFinderFactory.class.getName()
> > 2. Create an XML descriptor and pass it as a byte[] array
> > 3. Load the resources (e.g., clusters) in a resources map consisting
> > of the id of the resource and the serializer.
> > 4. The sequenceCodec: BIO or BILOU.
> >
> > There Namefinder documentation was already updated:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup
> >
> > There is sample code to do that in the CLI class:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262=markup
> >
> > and to run it from the CLI:
> >
> > 1. Create an XML feature descriptor, e.g., brown-feature.xml
> >
> > 
> >   
> > 
> >   
> > 
> >   
> >   
> > 
> >   
> > 
> >   
> > 
> >
> > 2. Put your clustering lexicon(s) in a directory, .e.g, clusters
> > 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
> > -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
> > en -model brown.bin -data
> > ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8
> >
> > If you open the brown.bin model you will see the cluster lexicon
> > seralized inside the model.
> >
> > You can now use it like any other model, the TokenNameFinderFactory
> > will read again all the required resources when loading the model in
> > the TokenNameFinderME class.
> >
> > HTH,
> >
> > R
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos 
> > wrote:
> > > Hi,
> > >
> > > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> > > constructors for the class *NameFinderME*:
> > >
> > 

Re: Language Model contribution

2016-02-17 Thread Joern Kottmann
Ups, confused the language model you were working on with language
detection.
I think the interface is good as it is.

Jörn

On Wed, Feb 17, 2016 at 10:00 AM, Joern Kottmann <kottm...@gmail.com> wrote:

> Hello,
>
> I saw the language model commit. Thanks for contributing that!
>
> Would it be possible to get a short introduction to it?
>
> The interface is supposed to take a StringList. Wouldn't it be better if a
> user can just pass in a String instead? Otherwise he has to worry about
> tokenizing a string in a language he doesn't know. I think that should be
> the task of the language detector.
>
> Can we come up with another name for the package? Maybe langid/langdetect
> or something similar? Any opinions?
>
> The Model in LanguageModel we usually use to refer to machine learning
> models, maybe we could rename this interface to LanguageDetector.
>
> Jörn
>


Language Model contribution

2016-02-17 Thread Joern Kottmann
Hello,

I saw the language model commit. Thanks for contributing that!

Would it be possible to get a short introduction to it?

The interface is supposed to take a StringList. Wouldn't it be better if a
user can just pass in a String instead? Otherwise he has to worry about
tokenizing a string in a language he doesn't know. I think that should be
the task of the language detector.

Can we come up with another name for the package? Maybe langid/langdetect
or something similar? Any opinions?

The Model in LanguageModel we usually use to refer to machine learning
models, maybe we could rename this interface to LanguageDetector.

Jörn


Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Joern Kottmann
On Thu, 2015-11-12 at 15:43 +, Russ, Daniel (NIH/CIT) [E] wrote:
> 1) I use the old sourceforge models.  I find that the source of error
> in my analysis are usually not do to mistakes in sentence detection or
> POS tagging.  I don’t have the annotated data or the time/money to
> build custom models.  Yes, the text I analyze is quite different than
> the (WSJ? or what corpus was used to build the models), but it is good
> enough. 

That is interesting, wasn't aware of that those are still useful.

It really depends on the component as well, I was mostly thinking about
the name finder models when I wrote that.

Do you only use the Sentence Detector, Tokenizer and POS tagger?

You could use OntoNotes (almost for free) to train models. Maybe we
should look into distributing models trained on OntoNotes.

Jörn



signature.asc
Description: This is a digitally signed message part


Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Joern Kottmann
On Thu, 2015-11-12 at 19:50 +, Jason Baldridge wrote:
> Having said that, there is a lot of activity in the deep learning
> space,
> where old techniques (neural nets) are now viable in ways they weren't
> previously, and they are outperforming linear classifiers in task
> after
> task. I'm currently looking at Deeplearning4J, and it would be great
> to
> have OpenNLP or a project like it make solid NLP models available
> based on
> deep learning methods, especially LSTMs and Convolutional Neural Nets.
> Deeplearning4J is Java/Scala friendly and it is ASL, so that's at
> least
> setting off on the right foot.
> 
> http://deeplearning4j.org/


I hope I can find a bit of time to write an integration for it.
Thanks for sharing!

Jörn



signature.asc
Description: This is a digitally signed message part


Re: mallet addon

2015-10-20 Thread Joern Kottmann
Hello,

I updated the code and afterwards spent some time evaluating it again. The
maxent training is very close to our maxent classifier. I also checked the
training code again and it looks good to me, but it would be nice if you
can review it.

There are a couple of other classifiers in mallet, it should be trivial to
expose them all to OpenNLP.

Jörn

On Tue, Oct 20, 2015 at 9:12 AM, Rodrigo Agerri <rodrigo.age...@ehu.eus>
wrote:

> Hello,
>
> Thanks. I thought I had an idea for CRF not obtaining good results
> with OpenNLP default features, e.g.,
>
> http://lingpipe-blog.com/2006/11/22/why-do-you-hate-crfs/
>
> but if results are also worse in Maxent, that is intriguing. I will
> look at the Mallet implementation to see if I find out something.
>
> R
>
>
>
> On Mon, Oct 12, 2015 at 4:07 PM, Joern Kottmann <kottm...@gmail.com>
> wrote:
> > Hello,
> >
> > fixed up the code a bit. The performance is not really good. Do you have
> > any idea why that could be?
> >
> > Neither the maxent or crf get good evaluation numbers on NER.
> >
> > I will push the changes and then you can experiment with it too.
> >
> > Jörn
> >
> >
> > On Mon, Oct 5, 2015 at 4:45 PM, Rodrigo Agerri <rage...@apache.org>
> wrote:
> >
> >> Hi,
> >>
> >> On Tue, Sep 29, 2015 at 3:41 PM, Joern Kottmann <kottm...@gmail.com>
> >> wrote:
> >> > We can also move
> >> > it to the sandbox, releasing it at Apache might be more difficult
> since
> >> > mallet pulls in incompatible licensed dependencies. But maybe that
> >> changed,
> >> > we can check.
> >>
> >> Mallet is released under Common Public License
> >>
> >> http://opensource.org/licenses/cpl1.0.php
> >>
> >> but as you have mentioned, it pulls several dependencies that are
> >> LGPL. These are the dependencies:
> >>
> >>   
> >>   org.beanshell
> >>   bsh
> >>   2.0b4
> >> 
> >>
> >> This version is LGPL, however, later versions are APL 2.0
> >>
> >> https://github.com/beanshell/beanshell
> >>
> >> 
> >>   jgrapht
> >>   jgrapht
> >>   0.6.0
> >> 
> >>
> >> that version was also LGPL, but it has now been dual-licensed with EPL
> 1.0
> >>
> >> https://github.com/jgrapht/jgrapht/wiki/Relicensing
> >>
> >> which could be included also in APL 2.0 projects
> >>
> >> http://www.apache.org/legal/resolved.html
> >>
> >>  
> >>   net.sf.jwordnet
> >>   jwnl
> >>   1.4_rc3
> >> 
> >>
> >> BSD license, but this library has already been discussed here.
> >>
> >>  
> >>   net.sf.trove4j
> >>   trove4j
> >>   2.0.2
> >> 
> >>
> >> LGPL-ed.
> >>
> >> 
> >>   com.googlecode.matrix-toolkits-java
> >>   mtj
> >>   0.9.14
> >> 
> >>
> >> also LGPL
> >>
> >> Rodrigo
> >>
>


Re: Out of Bounds Exception in BioCodec.class

2015-10-07 Thread Joern Kottmann
Hello,

I can't see the exception. Can you post it just as text please.

Thanks,
Jörn

On Wed, 2015-10-07 at 10:56 -0400, Blizzard, Zach wrote:
> Hey Dev team,
> 
>  
> 
> I have a quick question about the BioCodec class: I’m trying to create
> my own model to train the OpenNLP program, but I’m running into an
> “array index out of bounds” exception in the BioCodec class everytime.
> 
>  
> 
> Below is a screen shot of the exception:
> 
>  
> 
> 
> 
> This happens multiple times during the iteration process; only after
> the third iteration does the program stop. I’m not sure if there’s
> anything I can do to resolve this issue. Any input would help.
> 
>  
> 
> Thanks,
> 
> Zach Blizzard
> 
>  
> 
> 



signature.asc
Description: This is a digitally signed message part


Re: mallet addon

2015-09-29 Thread Joern Kottmann
Hello,

this doesn't work with the 1.6.0 release, I build it for testing of one of
the first drafts of the machine learning rewrite work we did for 1.6.0.
There have been a few changes afterwards.
Anyway, if you have a need for it I am happy to fix it up. We can also move
it to the sandbox, releasing it at Apache might be more difficult since
mallet pulls in incompatible licensed dependencies. But maybe that changed,
we can check.

What do you think?

Jörn



On Tue, Sep 29, 2015 at 2:34 PM, Rodrigo Agerri  wrote:

> Hello,
>
> I have seen that there is a mallet addon here
>
> https://github.com/kottmann/opennlp-mallet-addon
>
> is this currently being used or integrated in opennlp? I have not seen
> with the rest of the addons.
>
> Cheers,
>
> Rodrigo
>


Re: svn commit: r1681259 - in /opennlp/trunk: opennlp-distr/pom.xml opennlp-docs/pom.xml opennlp-tools/pom.xml opennlp-uima/pom.xml pom.xml

2015-09-03 Thread Joern Kottmann
Hello,

yes the github apache/opennlp repository is always synchronized with our
subversion repository here at Apache.
If you have a look you will see recent changes in there.

Jörn

On Tue, May 26, 2015 at 6:07 AM, Ethan Wang  wrote:

> Hey folks,
>
> is g...@github.com:apache/opennlp.git still an official place for this
> project? If so is there anyone doing sync between svn and that?
>
> Thanks,
>
> Ethan
>
>
>
> > On May 22, 2015, at 9:19 PM, co...@apache.org wrote:
> >
> > Author: colen
> > Date: Sat May 23 02:19:41 2015
> > New Revision: 1681259
> >
> > URL: http://svn.apache.org/r1681259
> > Log:
> > [maven-release-plugin] prepare for next development iteration
> >
> > Modified:
> >opennlp/trunk/opennlp-distr/pom.xml
> >opennlp/trunk/opennlp-docs/pom.xml
> >opennlp/trunk/opennlp-tools/pom.xml
> >opennlp/trunk/opennlp-uima/pom.xml
> >opennlp/trunk/pom.xml
> >
> > Modified: opennlp/trunk/opennlp-distr/pom.xml
> > URL:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-distr/pom.xml?rev=1681259=1681258=1681259=diff
> >
> ==
> > --- opennlp/trunk/opennlp-distr/pom.xml (original)
> > +++ opennlp/trunk/opennlp-distr/pom.xml Sat May 23 02:19:41 2015
> > @@ -24,7 +24,7 @@
> >   
> >   org.apache.opennlp
> >   opennlp
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> >   ../pom.xml
> >   
> >
> > @@ -37,12 +37,12 @@
> >   
> >   org.apache.opennlp
> >   opennlp-tools
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> >   
> >   
> >   org.apache.opennlp
> >   opennlp-uima
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> >   
> >   
> >
> >
> > Modified: opennlp/trunk/opennlp-docs/pom.xml
> > URL:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/pom.xml?rev=1681259=1681258=1681259=diff
> >
> ==
> > --- opennlp/trunk/opennlp-docs/pom.xml (original)
> > +++ opennlp/trunk/opennlp-docs/pom.xml Sat May 23 02:19:41 2015
> > @@ -24,7 +24,7 @@
> >   
> >   org.apache.opennlp
> >   opennlp
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> > ../pom.xml
> >   
> >
> >
> > Modified: opennlp/trunk/opennlp-tools/pom.xml
> > URL:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/pom.xml?rev=1681259=1681258=1681259=diff
> >
> ==
> > --- opennlp/trunk/opennlp-tools/pom.xml (original)
> > +++ opennlp/trunk/opennlp-tools/pom.xml Sat May 23 02:19:41 2015
> > @@ -25,7 +25,7 @@
> >   
> > org.apache.opennlp
> > opennlp
> > -1.6.0
> > +1.6.1-SNAPSHOT
> > ../pom.xml
> >   
> >
> >
> > Modified: opennlp/trunk/opennlp-uima/pom.xml
> > URL:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-uima/pom.xml?rev=1681259=1681258=1681259=diff
> >
> ==
> > --- opennlp/trunk/opennlp-uima/pom.xml (original)
> > +++ opennlp/trunk/opennlp-uima/pom.xml Sat May 23 02:19:41 2015
> > @@ -25,7 +25,7 @@
> >   
> >   org.apache.opennlp
> >   opennlp
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> >   ../pom.xml
> > 
> >
> > @@ -46,7 +46,7 @@
> >   
> >   org.apache.opennlp
> >   opennlp-tools
> > - 1.6.0
> > + 1.6.1-SNAPSHOT
> >   
> >
> >   
> >
> > Modified: opennlp/trunk/pom.xml
> > URL:
> http://svn.apache.org/viewvc/opennlp/trunk/pom.xml?rev=1681259=1681258=1681259=diff
> >
> ==
> > Binary files - no diff available.
> >
> >
>
>


Re: Word Sense Disambiguator

2015-07-24 Thread Joern Kottmann
It would be nice if you could share instructions on how to run it.
I also would like to give it a try.

Jörn

On Fri, Jul 24, 2015 at 4:54 AM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Hello,
 Yes for the moment we are only using WordNet for sense definitions.The
 plan is to complete the package by mid to late August, but if you like you
 can follow up on the progress from the sandbox.
 Best regards,
 Anthony
  Date: Thu, 23 Jul 2015 15:36:57 +0300
  Subject: Word Sense Disambiguator
  From: cristian.petro...@gmail.com
  To: dev@opennlp.apache.org
 
  Hi,
 
  I saw that there are people actively working on a Word Sense
 Disambiguator.
  DO you guys know when will the module be ready to use? Also I assume that
  wordnet is used to define the disambiguated word meaning?
 
  Thanks,
  Cristian




Re: GSoC 2015 - WSD Module

2015-06-30 Thread Joern Kottmann
Can you please open some jira issues so we can better keep track of what
has to be done.

Jörn
On Jun 28, 2015 10:23 PM, Joern Kottmann kottm...@gmail.com wrote:

 Yes, the performance testing has to be there, otherwise it is hard to
 tell if it works or not.

 Jörn

 On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
  Dear Jörn,
 
  As a first milestone, for now we have the main interface with two
 implementations (one unsupervised, one supervised), maybe we can add an
 evaluator for performance tests and comparison with the test data we
 currently have (SemEval, SensEval test sets).
 
  Best,
 
  Anthony
 
   Subject: Re: GSoC 2015 - WSD Module
   From: kottm...@gmail.com
   To: dev@opennlp.apache.org
   Date: Thu, 25 Jun 2015 21:47:22 +0200
  
   On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
Hi,
   
I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many
 approaches need to be supported, but would like your recommendations.
Here are some notes :
   
1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by
 Rodrigo.
5- [Lesk] has many variants, we already implemented some, but
 wondering on the preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either
 making a parameter list to fill or make separate classes for each, or
 otherwise following your preference.
6- The other classes are for convenience.
   
We will try to patch frequently on the separate issues, following
 the feedback.
  
  
   Sounds good, I reviewed it and think what we have is quite ok.
  
   Most important now is to fix the smaller issues (see the jira issue)
 and
   explain to us how it can be run.
  
   The midterm evaluation is coming up next week as well.
  
   How are we standing with the milstone we set?
  
   Jörn
  
 




Re: GSoC 2015 - WSD Module

2015-06-28 Thread Joern Kottmann
Yes, the performance testing has to be there, otherwise it is hard to
tell if it works or not.

Jörn

On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 
 As a first milestone, for now we have the main interface with two 
 implementations (one unsupervised, one supervised), maybe we can add an 
 evaluator for performance tests and comparison with the test data we 
 currently have (SemEval, SensEval test sets).  
 
 Best,
 
 Anthony
 
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  Date: Thu, 25 Jun 2015 21:47:22 +0200
  
  On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
   Hi,
   
   I attached an initial patch to OPENNLP-758.
   However, we are currently modifying things a bit since many approaches 
   need to be supported, but would like your recommendations.
   Here are some notes : 
   
   1 - We used extJWNL
   2- [WSDisambiguator] is the main interface
   3- [Loader] loads the resources required
   4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
   5- [Lesk] has many variants, we already implemented some, but wondering 
   on the preferred way to switch from one to the other:
   As of now we use one of them as default, but we thought of either making 
   a parameter list to fill or make separate classes for each, or otherwise 
   following your preference.
   6- The other classes are for convenience.
   
   We will try to patch frequently on the separate issues, following the 
   feedback.
  
  
  Sounds good, I reviewed it and think what we have is quite ok.
  
  Most important now is to fix the smaller issues (see the jira issue) and
  explain to us how it can be run.
  
  The midterm evaluation is coming up next week as well.
  
  How are we standing with the milstone we set?
  
  Jörn
  
 



signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 Thank you for that.
 
 After further surveying, I was thinking of beginning the implementation of an 
 approach based on context clustering as a next step.
 Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
 dataset [2].Since clustering is usually done using K-means, which could take 
 some time with large data, this was already done previously and the results 
 were made publicly available in [3] with up to 20 closest clusters per 
 phrase.
 The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
 described in their paper.I believe this is straight-forward enough to 
 implement as another unsupervised approach for the proposed time-frame.
 Would like your opinion.

Sounds good to me. I will read the paper now, and come back here if I
have any questions.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
 Hi,
 
 I attached an initial patch to OPENNLP-758.
 However, we are currently modifying things a bit since many approaches need 
 to be supported, but would like your recommendations.
 Here are some notes : 
 
 1 - We used extJWNL
 2- [WSDisambiguator] is the main interface
 3- [Loader] loads the resources required
 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
 5- [Lesk] has many variants, we already implemented some, but wondering on 
 the preferred way to switch from one to the other:
 As of now we use one of them as default, but we thought of either making a 
 parameter list to fill or make separate classes for each, or otherwise 
 following your preference.
 6- The other classes are for convenience.
 
 We will try to patch frequently on the separate issues, following the 
 feedback.


Sounds good, I reviewed it and think what we have is quite ok.

Most important now is to fix the smaller issues (see the jira issue) and
explain to us how it can be run.

The midterm evaluation is coming up next week as well.

How are we standing with the milstone we set?

Jörn



signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 Thank you for that.
 
 After further surveying, I was thinking of beginning the implementation of an 
 approach based on context clustering as a next step.
 Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
 dataset [2].Since clustering is usually done using K-means, which could take 
 some time with large data, this was already done previously and the results 
 were made publicly available in [3] with up to 20 closest clusters per 
 phrase.
 The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
 described in their paper.I believe this is straight-forward enough to 
 implement as another unsupervised approach for the proposed time-frame.
 Would like your opinion.

Your users can just download the dataset and do the clustering them
self. It should be possible to do that anyway. All the code necessary to
do that should be available as part of your contribution.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-19 Thread Joern Kottmann
Hello,

I will dedicate time tonight to get this pulled in the sandbox and will
then also provide some feedback.
We can then create new patches against the sandbox to fix further issues.

Jörn

On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Thank you for the reply, I am guessing for now we will use the other
 sources.

 By the way, I  have uploaded a newer patch on the same issue [1].
 Would like to know if the approach to set parameters is acceptable.

 Also, we are referencing to some model files locally like tokenizer,
 tagger, etc because we need them for the preprocessing chain.for example :

 ++
 private static String modelsDir =
 src\\test\\resources\\opennlp\\tools\\disambiguator\\;

 TokenizerModel  tokenizerModel = new TokenizerModel(new
 FileInputStream(modelsDir + en-token.bin));tokenizer = new
 TokenizerME(tokenizerModel);
 ++

 Thought of adding these files (.bin) in the test folder, but could anyone
 recommend a more elegant way  to do this ?
 Thanks !

 Anthony

 [1] : https://issues.apache.org/jira/browse/OPENNLP-758


  From: rage...@apache.org
  Date: Fri, 19 Jun 2015 10:18:12 +0200
  Subject: Re: GSoC 2015 - WSD Module
  To: dev@opennlp.apache.org
 
  Thanks for the update and the updated patch.
 
  With respect to the licensing of BabelNet, I do not think we can
  redistribute CC BY-NC-SA resources here, but others in this project
  and Apache in general will probably know better than me.
 
  Best,
 
  Rodrigo
 
  On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
  anthonybeyler...@hotmail.com wrote:
   Hi,
   Concerning this point, I would like to ask about BabelNet [1].The
 advantages of [1] is that it integrates WordNet, Wikipedia, Wiktionary,
 OmegaWiki, Wikidata, and Open Multi-WordNet.
   Also, the newest SemEval task (which results are just out [2]) relies
 on it.
  
   Howeover, the 2.5.1 version, which can be used locally, follows a CC
 BY-NC-SA 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are
 acceptable, however I am not completely sure if the NC-SA
 (Non-commercial/ShareAlike) terms would be prohibitive since it was
 mentioned that :
   Many of these licenses have specific attribution terms that need to
 be adhered to, for example CC-A, often by adding them to the NOTICE file.
 Ensure you are doing this when including these works. Note, this list is
 colloquially known as the Category A list.
   Would like your thoughts on the matter.
   Thanks !
   Anthony
   [1] : http://babelnet.org/download[2] :
 http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] :
 https://creativecommons.org/licenses/by-nc-sa/3.0/
   [4] : http://www.apache.org/legal/resolved.html#category-a
  
   Date: Fri, 5 Jun 2015 15:09:24 +0200
   Subject: Re: GSoC 2015 - WSD Module
   From: kottm...@gmail.com
   To: dev@opennlp.apache.org
  
   Hello,
  
   yes, wordnet is fine, we already depend on it. I just think that
 remote
   resources are particular problematic.
  
   For local resources it boils down to their license.
  
   Here is the wordnet one:
   http://wordnet.princeton.edu/wordnet/license/
  
   We might even be able to redistribute this here at Apache, which is
 really
   nice. To do that we have to check
   with the legal list if they give a green light for it.
  
   You can get more information about licenses and dependencies for
 Apache
   projects here:
   http://www.apache.org/legal/resolved.html#category-a
   http://www.apache.org/legal/resolved.html#category-b
   http://www.apache.org/legal/resolved.html#category-x
  
   Are the things you have to clean up of the nature that you couldn't
 do that
   after you send in a patch?
   This could be removal of code which can be released under ASL.
  
   We would like to get you integrated into the way we work here as
 quickly as
   possible.
  
   That includes:
   - Tasks are planned/tracked via jira (this allows other people to
   comment/follow)
   - We would like to be able to review your code and maybe give some
 advice
   (commit often, break things down in tasks)
   - Changes or new features are usually discussed a on the dev list
 (e.g. a
   short write up about the approaches you implemented
 or better plan to implement)
  
   Jörn
  
  




Re: GSoC 2015 - WSD Module

2015-06-10 Thread Joern Kottmann
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.

Please try to submit the patch as quickly as possible.

Jörn

On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello,

 On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
 mondher.bouaz...@gmail.com wrote:
  Dear Rodrigo,
 
  As Anthony mentioned in his previous email, I already started the
  implementation of the IMS approach. The pre-processing and the extraction
  of features have already been finished. Regarding the approach itself, it
  shows some potential according to the author though the features proposed
  are not so many, and are basic.

 Hi, yes, the features are not that complex, but it is good to have a
 working system and then if needed the feature set can be
 improved/enriched. As stated in the paper, the IMS approach leverages
 parallel data to obtain state of the art results in both lexical
 sample and all words for senseval 3 and semeval 2007 datasets.

 I think it will be nice to have a working system with this algorithm
 as part of the WSD component in OpenNLP (following the API discussion
 previous in this thread) and perform some evaluations to know where
 the system is with respect to state of the art results in those
 datasets. Once this is operative, I think it will be a good moment to
 start discussing additional/better features.

  I think the approach itself might be
  enhanced if we add more context specific features from some other
  approaches... (To do that, I need to run many experiments using different
  combinations of features, however, that should not be a problem).

 Speaking about the feature sets, in the API google doc I have not seen
 anything about the implementation of the feature extractors, could you
 perhaps provide some extra info (in that same document, for example)
 about that?

  But the approach itself requires a linear SVM classifier, and as far as I
  know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
 libsvm
  ?

 I think you can try with a MaxEnt to start with and in the meantime,
 @Jörn has commented sometimes that there is a plugin component in
 OpenNLP to use third-party ML libraries and that he tested it with
 Mallet. Perhaps he could comment on this to use that functionality to
 use SVMs.

 
  Regarding the training data, I started collecting some from different
  sources. Most of the existing rich corpora are licensed (Including the
 ones
  mentioned in the paper). The free ones I got for now are from the
 Senseval
  and Semeval websites. However, these are used just to evaluate the
 proposed
  methods in the workshops. Therefore, the words to disambiguate are few in
  number though the training data for each word are rich enough.
 
  In any case, the first tests with Senseval and Semeval collected should
 be
  finished soon. However, I am not sure if there is a rich enough Dataset
 we
  can use to make our model for the WSD module in the OpenNLP library.
  If you have any recommendation, I would be grateful if you can help me on
  this point.

 Well, as I said in my previous email, research around word senses is
 moving from WSD towards Supersense tagging where there are recent
 papers and freely available tweet datasets, for example. In any case,
 we can look more into it but in the meantime the Semcor for training
 and senseval/semeval2007 datasets for evaluation should be enough to
 compare your system with the literature.

 
  As Jörn mentioned sending an initial patch, should we separate our codes
  and upload two different patches to the two issues we created on the Jira
  (however, this means a lot of redundancy in the code), or shall we keep
  them in one project and upload it? If we opt for the latter case, which
  issue should we upload the patch to ?

 In my opinion, it should be the same patch and same Component with
 different algorithm implementations within it. Any other opinions?

 Cheers,

 Rodrigo



Re: GSoC 2015 - WSD Module

2015-06-05 Thread Joern Kottmann
Hello,

yes, wordnet is fine, we already depend on it. I just think that remote
resources are particular problematic.

For local resources it boils down to their license.

Here is the wordnet one:
http://wordnet.princeton.edu/wordnet/license/

We might even be able to redistribute this here at Apache, which is really
nice. To do that we have to check
with the legal list if they give a green light for it.

You can get more information about licenses and dependencies for Apache
projects here:
http://www.apache.org/legal/resolved.html#category-a
http://www.apache.org/legal/resolved.html#category-b
http://www.apache.org/legal/resolved.html#category-x

Are the things you have to clean up of the nature that you couldn't do that
after you send in a patch?
This could be removal of code which can be released under ASL.

We would like to get you integrated into the way we work here as quickly as
possible.

That includes:
- Tasks are planned/tracked via jira (this allows other people to
comment/follow)
- We would like to be able to review your code and maybe give some advice
(commit often, break things down in tasks)
- Changes or new features are usually discussed a on the dev list (e.g. a
short write up about the approaches you implemented
  or better plan to implement)

Jörn




On Fri, Jun 5, 2015 at 2:24 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Hi,

 We understand the issues.

 So just to make sure, we are currently relying on JWNL to access WordNet
 as a resource. Is that fine for now ?

 In case we need to avoid such dependencies,  would it be ok to create a
 resource file that includes what we need extracted from it or also from
 other resources combined (sense inventory, word relationships and so on) ?
 We'd like your recommendation.

 Also we are currently cleaning up the project and will upload a patch.
 To sum up, we have already implemented the Lesk approach, as well as parts
 of the supervised IMS approach (preprocessing, feature extraction).
 Next, we will implement the baseline techniques and collect the training
 data that will be used by supervised approaches.
 Files will be collected from different sources and will be unified in a
 single model file.
 Best regards,

 Anthony, Mondher


  Date: Wed, 3 Jun 2015 16:47:50 +0200
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
 
  We should not use remote resources. A remote service adds severe limits
 to
  the WSD component. A remote resource will be slow to query (compared to
  disk or memory), queries might be expensive (pay per request), the
 license
  might not allow usage in a way the ASL promises to our users. Another
 issue
  is that calling a remote service might leak the document text itself to
  that remote service.
 
  Please attach a patch to the jira issue, and then we can pull it into the
  sandbox.
 
  Jörn
 
 
 
 
 
  On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian 
  anthonybeyler...@hotmail.com wrote:
 
   Dear Jörn,
  
   Thank you for the reply.===
   Yes in the draft WSDisambiguator is the main interface.
   ===
   Yes for the disambiguate method the input is expected to be tokenized,
 it
   should be an input array.
   The second argument is for the token index.  We can also make it into
 an
   index array to support multiple words.
   ===
   Concerning the resources, we expect two types of resources : local and
   remote resources.
  
   + For local resources, we have two main types :
   1- training models for supervised techniques.
   2- knowledge resources
  
   It could be best to make the packaging using similar OpenNLP models
 for #1.
   As for #2, it will depend on what we want to use,  since the type of
   information depends on the specific technique.
  
   + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might
 need
   to have some REST support, for example to retrieve a sense inventory
 for a
   certain word.Actually, the newest semeval task [Semeval15] will use
   [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
   version, but the newest one is only available through REST.Also, in
 case it
   is needed to use a remote resource, AND it typically requires a
 license, we
   need to use a license key or just use the free quota with no key.
  
   Therefore, we thought of having a [ResourceProvider] as mentioned in
 the
   [draft].
   Are there any plans to add an external API connector of the sort or is
   this functionality already possible for extension ?
   (I noticed there is a [wikinews_importer] in the sanbox)
  
   But in any case we can always start working only locally as a first
 step,
   what do you think ?
   ===
   It would be more straightforward to use the algorithm names, so ok why
 not.
   ===
   Yes we have already started working !
   What do we 

Re: GSoC 2015 - WSD Module

2015-06-03 Thread Joern Kottmann
We should not use remote resources. A remote service adds severe limits to
the WSD component. A remote resource will be slow to query (compared to
disk or memory), queries might be expensive (pay per request), the license
might not allow usage in a way the ASL promises to our users. Another issue
is that calling a remote service might leak the document text itself to
that remote service.

Please attach a patch to the jira issue, and then we can pull it into the
sandbox.

Jörn





On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Dear Jörn,

 Thank you for the reply.===
 Yes in the draft WSDisambiguator is the main interface.
 ===
 Yes for the disambiguate method the input is expected to be tokenized, it
 should be an input array.
 The second argument is for the token index.  We can also make it into an
 index array to support multiple words.
 ===
 Concerning the resources, we expect two types of resources : local and
 remote resources.

 + For local resources, we have two main types :
 1- training models for supervised techniques.
 2- knowledge resources

 It could be best to make the packaging using similar OpenNLP models for #1.
 As for #2, it will depend on what we want to use,  since the type of
 information depends on the specific technique.

 + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
 to have some REST support, for example to retrieve a sense inventory for a
 certain word.Actually, the newest semeval task [Semeval15] will use
 [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
 version, but the newest one is only available through REST.Also, in case it
 is needed to use a remote resource, AND it typically requires a license, we
 need to use a license key or just use the free quota with no key.

 Therefore, we thought of having a [ResourceProvider] as mentioned in the
 [draft].
 Are there any plans to add an external API connector of the sort or is
 this functionality already possible for extension ?
 (I noticed there is a [wikinews_importer] in the sanbox)

 But in any case we can always start working only locally as a first step,
 what do you think ?
 ===
 It would be more straightforward to use the algorithm names, so ok why not.
 ===
 Yes we have already started working !
 What do we need to push to the sandbox ?
 ===

 Thanks !

 Anthony

 [BabelNet] : http://babelnet.org/download
 [WordsAPI] : https://www.wordsapi.com/
 [Semeval15] : http://alt.qcri.org/semeval2015/task13/
 [draft] :
 https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  Date: Mon, 1 Jun 2015 20:30:08 +0200
 
  Hello,
 
  I had a look at your APIs.
 
  Lets start with the WSDisambiguator. Should that be an interface?
 
  // returns the senses ordered by their score (best one first or only 1
  in supervised case)
  String[] disambiguate(String inputText,int inputWordposition);
 
  Shouldn't we have a tokenized input? Or is the inputText a token?
 
  If you have resources you could package those into OpenNLP models and
  use the existing serialization support. Would that work for you?
 
  I think we should have different implementing classes for different
  algorithms rather than grouping that in the Supervised and Unsupervised
  classes. And also use the algorithm / approach name as part of the class
  name.
 
  As far as I understand you already started to work on this. Should we an
  initial code drop into the sandbox, and then work out things from there?
  We strongly prefer to have as much as possible source code editing
  history in our version control system.
 
  Jörn




Re: GSoC 2015 - WSD Module

2015-06-01 Thread Joern Kottmann
Hello,

I had a look at your APIs.

Lets start with the WSDisambiguator. Should that be an interface?

// returns the senses ordered by their score (best one first or only 1
in supervised case)
String[] disambiguate(String inputText,int inputWordposition);

Shouldn't we have a tokenized input? Or is the inputText a token?

If you have resources you could package those into OpenNLP models and
use the existing serialization support. Would that work for you?

I think we should have different implementing classes for different
algorithms rather than grouping that in the Supervised and Unsupervised
classes. And also use the algorithm / approach name as part of the class
name.

As far as I understand you already started to work on this. Should we an
initial code drop into the sandbox, and then work out things from there?
We strongly prefer to have as much as possible source code editing
history in our version control system.

Jörn 

On Sat, 2015-05-23 at 01:44 +0900, Anthony Beylerian wrote:
 Hello,
 
 Thank you for the feedback.
 
 Please use this link to access a quick draft of the interface :
 https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
 
 I believe the previously mentioned link was not allowing for document updates.
 
 As for the common interface, since supervised methods rely on classifiers 
 they will need to load/save the training models, so we will need to separate 
 the two, maybe as in the draft.
 However we could keep a parent class with a common [disambiguate] method that 
 can be used for evaluation tasks and others.
 
 Thanks !
 
 Anthony
 
 
 
  Date: Fri, 22 May 2015 09:18:39 +0200
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  
  Hello,
  
  one of the tasks we should start is, is to define the interface for the WSD
  component.
  
  Please have a look at the other components in OpenNLP and try to propose an
  interface in a similar style.
  Can we use one interface for all the different implementations?
  
  Jörn
  
  
  On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi 
  mondher.bouaz...@gmail.com wrote:
  
   Dear all,
  
   Sorry if you received multiple copies of this email (The links were
   embedded). Here are the actual links:
  
   *Figure:*
  
   https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
   *Semeval/senseval results summary:*
  
   https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
   *Literature survey of WSD techniques:*
  
   https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing
  
   Yours faithfully
  
   On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
   anthonybeyler...@hotmail.com wrote:
  
Please excuse the duplicate email, we could not attach the mentioned
figure.
Kindly find it here.
Thank you.
   
From: anthonybeyler...@hotmail.com
To: dev@opennlp.apache.org
Subject: GSoC 2015 - WSD Module
Date: Mon, 18 May 2015 22:14:43 +0900
   
   
   
   
Dear all,
In the context of building a Word Sense Disambiguation (WSD) module,
   after
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised,
unsupervised/knowledge based, hybrid) - WSD is used for different
   directly
related objectives such as all-words disambiguation, lexical sample
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
   seem
to be good references to compare different techniques for WSD since many
   of
them were tested on the same data (but different one each event).- For
   the
sake of making a first solution, we propose to start with supporting the
lexical sample type of disambiguation, meaning to disambiguate
single/limited word(s) from an input text.
Therefore, we have decided to collect information about the different
techniques in the literature (such as  references, performance,
   parameters
etc.) in this spreadsheet here.Otherwise we have also collected the
   results
of all the senseval/semeval exercises here.(Note that each document has
many sheets)The collected results, could help decide on which techniques
   to
start with as main models for each set of techniques
(supervised/unsupervised).
We also propose a general approach for the package in the figure
attached.The main components are as follows :
1- The different resources publicly available : WordNet, BabelNet,
Wikipedia, etc.However, we would also like to allow the users to use
   their
own local resources, by maybe defining a type of connector to the
   resource
interface.
2- The resource interface will have the role to provide both a sense
inventory that the user can query and a knowledge base (such as semantic
   or
syntactic info. etc.) that might be used depending on the 

Re: OpenNLP 1.6.0 RC 4 ready for testing

2015-05-28 Thread Joern Kottmann
The chunker and parser tests are fine now.

Do you know what's the deal with the sentence detector?

The compatibility test is marked as failed. Can we leave it like that or do
we have to fix some bugs?

Jörn
On May 23, 2015 5:35 AM, William Colen co...@apache.org wrote:

 Our fourth release candidate is ready for testing. RC 3 failed in the
 compatibility, regression and performance tests, which are fixed in RC 4.

 The RC 4 can be downloaded from here:
 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc4/

 To use it in a maven build set the version for opennlp-tools or
 opennlp-uima to 1.6.0 and add the following URL to your settings.xml file:
 https://repository.apache.org/content/repositories/orgapacheopennlp-1003

 The current test plan can be found here:
 https://cwiki.apache.org/confluence/display/OPENNLP/TestPlan1.6.0

 Please sign up for tasks in the test plan.

 The release plan can be found here:

 https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.6.0

 The release contains quite some changes, please refer to the contained
 issue list for details.

 For your convenience, a copy of the issue list, as well as the release
 notes and the readme, can be found in the following link:


 http://people.apache.org/~colen/releases/opennlp-1.6.0/rc4/RELEASE_NOTES.html


 Thank you,
 William



Re: GSoC 2015 - WSD Module

2015-05-22 Thread Joern Kottmann
Hello,

one of the tasks we should start is, is to define the interface for the WSD
component.

Please have a look at the other components in OpenNLP and try to propose an
interface in a similar style.
Can we use one interface for all the different implementations?

Jörn


On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi 
mondher.bouaz...@gmail.com wrote:

 Dear all,

 Sorry if you received multiple copies of this email (The links were
 embedded). Here are the actual links:

 *Figure:*

 https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
 *Semeval/senseval results summary:*

 https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
 *Literature survey of WSD techniques:*

 https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

 Yours faithfully

 On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
 anthonybeyler...@hotmail.com wrote:

  Please excuse the duplicate email, we could not attach the mentioned
  figure.
  Kindly find it here.
  Thank you.
 
  From: anthonybeyler...@hotmail.com
  To: dev@opennlp.apache.org
  Subject: GSoC 2015 - WSD Module
  Date: Mon, 18 May 2015 22:14:43 +0900
 
 
 
 
  Dear all,
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid) - WSD is used for different
 directly
  related objectives such as all-words disambiguation, lexical sample
  disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
 seem
  to be good references to compare different techniques for WSD since many
 of
  them were tested on the same data (but different one each event).- For
 the
  sake of making a first solution, we propose to start with supporting the
  lexical sample type of disambiguation, meaning to disambiguate
  single/limited word(s) from an input text.
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.Otherwise we have also collected the
 results
  of all the senseval/semeval exercises here.(Note that each document has
  many sheets)The collected results, could help decide on which techniques
 to
  start with as main models for each set of techniques
  (supervised/unsupervised).
  We also propose a general approach for the package in the figure
  attached.The main components are as follows :
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.However, we would also like to allow the users to use
 their
  own local resources, by maybe defining a type of connector to the
 resource
  interface.
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.We
  might even later consider building a local cache for remote services.
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.These techniques will
  be split into two main packages as in the left side of the figure :
  Supervised/Unsupervised.The utils package includes common tools used in
  both types of techniques.The details mentioned in each package should be
  common to all implementations of these abstract models.
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi
 



W2VClassesDictionary class

2015-05-22 Thread Joern Kottmann
Hello,

looks like this class was renamed into WordClusterDictionary.

Can the class W2VClassesDictionary be removed?
We shouldn't include it in RC4 when it is not necessary.

Thanks,
Jörn


OpenNLP RC4

2015-05-22 Thread Joern Kottmann
Hello,

we should now be in a good state to do RC4. We finally solved
the performance problems with the parser and a couple
of very minor things where fixed as well (e.g NOTICE file update).

A major addition since RC3 are the automated evaluation tests
to speed up our release process. I hope this will significantly reduce
the amount of time required to ensure RC4 is working properly.

Jörn


Re: How to start contributing to OpenNLP

2015-05-12 Thread Joern Kottmann
Hello,

the best way to start is to find something you feel comfortable doing.
That could be fixing a bug or implementing a certain feature.

Yes, have a look at JIRA there are many issues.

Is there some component you would prefer working on?

HTH,
Jörn


On Tue, May 12, 2015 at 5:34 PM, Haider Ali alihaider...@gmail.com wrote:

 Hello Everyone,

 I am new to OpenNLP group, i want to contribute to OpenNLP. Kindly guide me
 where should i start ? Should i be looking straight at JIRA ?

 Thank You

 --
 Haider Ali



Re: Automated testing with public data

2015-04-29 Thread Joern Kottmann
Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
wrote:

 Automating the download would be fine as long as we cache it, as Richard
 suggested. Maybe it could be done by a script to prepare the environment,
 and not be part of the unit test itself.
 Anyway, it would be a good idea to save the data somewhere because we never
 know if some of the websites will become unavailable in the future.


 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
 richard.eck...@gmail.com:

  On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
 
   With publicly accessible data I mean a corpus you can somehow acquire,
   opposed to the data you create on your own for a project.
  
   All the corpora we support in the formats package are publicly
  accessible.
   Maybe
   some you have to buy and for others you just have to sign some
 agreement.
  
   A very interesting corpus for testing (and training models on) is
  OntoNotes.
  
   Here is a link to the LDC entry:
   https://catalog.ldc.upenn.edu/LDC2011T03
  
   You can get it for free (or for a small distribution fee) but you can't
   just download it.
   It would be great if the ASF could acquire this data set so we can
 share
  it
   among the committers.
  
   Is that what you mean with proprietary data?
 
  Yes, that is what I mean.
 
  E.g. the TIGER corpus requires clicking through some pages and forms to
  reach a download page, but in principle, it appears as if the corpus was
  simply downloadable by a deep-link URL. The license terms state, that the
  corpus must not be redistributed.
 
  Some tools are also publicly accessible and downloadable but not
  redistributable. For example anybody can download TreeTagger and its
  models, but only from the original homepage. It is not permitted to
  redistribute it, i.e. to publish it to a repository or offer it on an
  alternative homepage.
 
  So there is a (small) class of resources between being redistributable
 and
  proprietary (for fee), namely being in principle publicly accessible (for
  free) but not redistributable.
 
  Cheers,
 
  -- Richard



Automated testing with public data

2015-04-14 Thread Joern Kottmann
Hi all,

this time the progress with the testing for 1.6.0 is rather slow. Most
tests are done now and I believe we are in a good shape to build RC3.
Anyway it would have bee better to be at that stage month ago.

To improve the situation in the future I would like to propose to automate
all tests which can be run against data which is publicly available. These
tests are all set up following the same pattern, they train a component on
a corpus and afterwards evaluate against it. If the results matches the
result of the previous release we hope the code doesn't contain any
regressions. In some cases we have changes which influence the performance
(e.g. bug fixes) in that case we adjust the expected performance score and
carefully test that a particular change caused it.

We sometimes have changes which shouldn't influence the performance of a
component but still do due to some mistakes. These we need to identify
during testing.

The big issue we have with testing against public data is that we usually
can't include the data in the OpenNLP release because of their license. And
today we just do all the work manually by training on a corpus and
afterwards running the built in evaluation against the model.

I suggest we write JUnit tests which are doing this in case the user has
the right corpus for the test. Those tests will be disabled by default and
can be run by providing the -Dtest property and the location of the data
director.

For example.
mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data

The tests will do all the work and fail if the expected results don't match.

Automating those tests has the great advantage that we can run them much
more frequently during the development phase and hopefully identify bugs
before we even start with the release process.
Addionally we might be able to run that on our build server.

Any opinions?

Jörn


Re: svn commit: r1670574 - /opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java

2015-04-01 Thread Joern Kottmann
The adaptive data is cleared in the documentDone method. The statement in
the issue that it is not cleared is not true afaik.

Jörn

On Wed, Apr 1, 2015 at 9:47 AM, tomm...@apache.org wrote:

 Author: tommaso
 Date: Wed Apr  1 07:47:41 2015
 New Revision: 1670574

 URL: http://svn.apache.org/r1670574
 Log:
 OPENNLP-764 - applied patch from Pablo Duboue, clearing adaptive data
 after doc processing

 Modified:

 opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java

 Modified:
 opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java
 URL:
 http://svn.apache.org/viewvc/opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java?rev=1670574r1=1670573r2=1670574view=diff

 ==
 ---
 opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java
 (original)
 +++
 opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/namefind/NameFinder.java
 Wed Apr  1 07:47:41 2015
 @@ -169,6 +169,8 @@ public final class NameFinder extends Ab
documentConfidence.add(prob);
  }

 +mNameFinder.clearAdaptiveData();
 +
  return names;
}

 @@ -210,4 +212,4 @@ public final class NameFinder extends Ab
public void destroy() {
  mNameFinder = null;
}
 -}
 \ No newline at end of file
 +}





Re: Regarding performance of opennlp entity extraction modals

2015-03-16 Thread Joern Kottmann
Hello,

I don't have any numbers for you. The performance depends highly on the
model you are using, the configured feature generation and the number of
features in your training data.

To get a good number you probably have to run a test on your machines.
All modern CPUs have multiple cores these days, so you can run the same
process once per core.

Other things which might limit your throughput are the way you read the
text data and store the results.

Jörn

On Mon, 2015-03-16 at 19:04 +0530, Anuj Chopra wrote:
 hi,
 i wanted some information regarding the performance of opennlp entity
 extraction modals in documents/seconds and Mb/seconds.
 Currently I am using person, location, organisation and money extraction
 modals.
 If possible, please tell the speeds when combination of modals is used too.
 Thank you
 -anuj chopra



signature.asc
Description: This is a digitally signed message part


Re: Student looking to contribute toward OpenNLP

2015-03-16 Thread Joern Kottmann
Hello,

thanks for your interest in OpenNLP. We already have a lot of candidates
for those GSOC issues.

You are welcome to suggest something you would like to work on here on
the dev list, create an issue for it and contribute some code to solve
it.

The best way to get started is probably to look for an existing issue
which sounds like you can tackle it and send us a patch for it.

A good way to get started is probably to add support for a new corpus to
OpenNLP. This teaches you many basics about on how to train the
components.

HTH,
Jörn

On Mon, 2015-03-16 at 09:34 +0530, Rohit Shinde wrote:
 Hello everyone,
 
 I still haven't got a reply to my previous email and I would really
 appreciate a reply to that.
 
 I would like to contribute as soon as possible.
 
 Thank you.



signature.asc
Description: This is a digitally signed message part


Re: Parser performance bug

2015-03-09 Thread Joern Kottmann
On Fri, 2015-03-06 at 21:07 +0100, Joern Kottmann wrote:
 The parser still uses the old style of setting the beam size via the
 constructor. Due to the changes to move that to the training time it
 doesn't work anymore. The parser has to be changed to set the beam
 size
 during training time instead.


I committed a fix for this under OPENNLP-763. The parser should work now
again like it did in 1.5.3.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: Parser performance bug

2015-03-06 Thread Joern Kottmann
Hello,

made some progress with this. The problem is caused by the handling of
the beam size for the POS Tagger.

One way to set the beam size is to include it in the training params.
This method is the only way which works properly with the redesign of
the ml package.

In 1.6.0 it is possible to specify a user implemented classifier and not
all classifiers are using BeamSearch. The parameter doesn't make sense
without BeamSearch. Therefore all the constructors where the beam size
param can be specified should be deprecated/removed.

Anyway, this way of setting the beam size doesn't work due to various
smaller issues in the code. I fixed that in OPENNLP-762.

The parser still uses the old style of setting the beam size via the
constructor. Due to the changes to move that to the training time it
doesn't work anymore. The parser has to be changed to set the beam size
during training time instead.

Jörn


On Sat, 2015-02-21 at 02:13 -0200, William Colen wrote:
 I might be totally wrong, but I have a feeling that the change is
 in ChunkerModel.java, because I also notice a change in the Chunker tool
 results. It could be somehow related to the changes in the parameters in
 that file. We can't discard the possibility that there was a bug that was
 fixed with the changes.
 
 
 Regards,
 William
 
 2015-02-16 12:17 GMT-02:00 Joern Kottmann kottm...@gmail.com:
 
  Hi all,
 
  the performance of the parser changed a bit. The output of the current
  version in 1.6.0 RC2 is different from the output of the 1.5.3 release.
  Even tough there shouldn't been any difference as far as I can see.
 
  The question of what caused that difference came up and I started to
  bisect it.
 
  Here are my results so far:
  1655561 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (head)
  1591889 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (5/2/14)
  1576093 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f  (3/10/14)
  1574819 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/6/14)
  1574524 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/5/14)
  1574505 - 93c912e100932384465ec740d144a94656f214d3 (3/5/14)
  1573000 - 93c912e100932384465ec740d144a94656f214d3 (2/28/14)
  1569434 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
  1569285 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
  1554795 - 93c912e100932384465ec740d144a94656f214d3 (1/2/14)
  1463979 - 93c912e100932384465ec740d144a94656f214d3 (1.5.3)
 
  The first column is the svn revision, the second column the hash of the
  output data and in the parenthesis is the date of the revision or the
  version.
 
  The change in the code which caused the difference happened in 1574524.
  I had a quick look there and couldn't see within a few minutes what
  caused the issue. I will probably again use a more systematic approach
  to find the exact change in that commit that causes the difference.
 
  Jörn
 
 
 



signature.asc
Description: This is a digitally signed message part


Re: [GSoC2015] OPENNLP-758

2015-03-05 Thread Joern Kottmann
Hello,

we got already two students for those two GSOC WSD tasks. They contacted
us a while ago (see the WSD thread on this list) and set up the tasks so
they can apply for it.

I am not sure if it makes much sense to break the WSD tasks further
down.

Do you have something else in mind you could work on? I hope it is still
possible to set up new GSOC tasks. Let me check that. And we would also
need more mentors.

HTH, 
Jörn

On Wed, 2015-03-04 at 10:41 +0530, Vidura Mudalige wrote:
 Hi all,
 
 I am Vidura, a third year Computer Science and Engineering undergraduate
 from University of Moratuwa. I'm very much interested in working with
 Apache OpenNLP project in GSoC 2015.
 
 I have worked in some open source projects. Also I have used Apache OpenNLP
 and Apache UIMA for some of my previous projects. Nowadays I am working in
 a open source project called WSO2 User Engagement Server.[1]
 
 I would like to resolve the issue OPENNLP-758.[2]. I cloned and
 successfully built the apache/opennlp.git.[3] I would like to know more
 details about the issue and expected deliverables.
 
 Thanks you.
 
 [1].https://github.com/wso2/product-ues/tree/dashboards-2.0
 [2].https://issues.apache.org/jira/browse/OPENNLP-758
 [3].https://github.com/apache/opennlp



signature.asc
Description: This is a digitally signed message part


Re: Word Sense Disambiguation

2015-02-16 Thread Joern Kottmann
On Mon, 2015-02-16 at 16:29 +0100, Aliaksandr Autayeu wrote:
 Jörn, to avoid ambiguity in case you addressed me to propose a WSD
 interface. I'd prefer Anthony to come up with a proposal, because he is
 closer to the multiple WSD algorithms that would be nice to include in the
 analysis.

Sorry, for being unclear, yes I addressed Anthony. But everybody who has
an opinion is very welcome to join the discussion or propose something.

Jörn



Parser performance bug

2015-02-16 Thread Joern Kottmann
Hi all,

the performance of the parser changed a bit. The output of the current
version in 1.6.0 RC2 is different from the output of the 1.5.3 release.
Even tough there shouldn't been any difference as far as I can see.

The question of what caused that difference came up and I started to
bisect it.

Here are my results so far:
1655561 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (head)
1591889 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (5/2/14)
1576093 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f  (3/10/14)
1574819 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/6/14)
1574524 - 1fe53c0aeaae1eb978dbb83f34b13944f2692b1f (3/5/14)
1574505 - 93c912e100932384465ec740d144a94656f214d3 (3/5/14)
1573000 - 93c912e100932384465ec740d144a94656f214d3 (2/28/14)
1569434 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
1569285 - 93c912e100932384465ec740d144a94656f214d3 (2/18/14)
1554795 - 93c912e100932384465ec740d144a94656f214d3 (1/2/14)
1463979 - 93c912e100932384465ec740d144a94656f214d3 (1.5.3)

The first column is the svn revision, the second column the hash of the
output data and in the parenthesis is the date of the revision or the
version.

The change in the code which caused the difference happened in 1574524.
I had a quick look there and couldn't see within a few minutes what
caused the issue. I will probably again use a more systematic approach
to find the exact change in that commit that causes the difference.

Jörn




Re: svn commit: r1655546 - in /opennlp/trunk/opennlp-tools: pom.xml src/test/java/opennlp/tools/ngram/ src/test/java/opennlp/tools/ngram/NGramModelTest.java src/test/resources/opennlp/tools/ngram/ src

2015-01-29 Thread Joern Kottmann
Or if that is a problem for the test, you could also tell RAT to ignore
it.

On my machine the test fails. The two strings don't match.

Jörn

On Thu, 2015-01-29 at 09:59 +0100, Tommaso Teofili wrote:
 right, thanks I'll fix both.
 
 Tommaso
 
 2015-01-29 9:54 GMT+01:00 Joern Kottmann kottm...@gmail.com:
 
  This file should have an AL header.
 
  Jörn
 
  On Thu, 2015-01-29 at 08:02 +, tomm...@apache.org wrote:
   Added:
  
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
   URL:
  
  http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml?rev=1655546view=auto
  
  ==
   ---
  
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
  (added)
   +++
  
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
  Thu Jan 29 08:02:31 2015
   @@ -0,0 +1,58 @@
   +?xml version=1.0 encoding=UTF-8?
   +dictionary case_sensitive=false
   +entry count=1
   +tokenbrown/token
   +tokenfox/token
   +/entry
 
 
 




Re: svn commit: r1655546 - in /opennlp/trunk/opennlp-tools: pom.xml src/test/java/opennlp/tools/ngram/ src/test/java/opennlp/tools/ngram/NGramModelTest.java src/test/resources/opennlp/tools/ngram/ src

2015-01-29 Thread Joern Kottmann
On Thu, 2015-01-29 at 08:02 +, tomm...@apache.org wrote:
 +String modelString = IOUtils.toString(nGramModelStream);
 +String outputString =
 out.toString(Charset.defaultCharset().name());

The XML serialization writes it in UTF-8. Shouldn't you use UTF-8 for
this test too instead of the default encoding?

Jörn



Re: svn commit: r1655546 - in /opennlp/trunk/opennlp-tools: pom.xml src/test/java/opennlp/tools/ngram/ src/test/java/opennlp/tools/ngram/NGramModelTest.java src/test/resources/opennlp/tools/ngram/ src

2015-01-29 Thread Joern Kottmann
It still fails in the assert. I didn't check but I guess the build
server has the same problem.

Jörn

On Thu, 2015-01-29 at 10:25 +0100, Tommaso Teofili wrote:
 even after my latest commit? If so I'll rearrange the test a bit.
 
 Tommaso
 
 2015-01-29 10:21 GMT+01:00 Joern Kottmann kottm...@gmail.com:
 
  Or if that is a problem for the test, you could also tell RAT to ignore
  it.
 
  On my machine the test fails. The two strings don't match.
 
  Jörn
 
  On Thu, 2015-01-29 at 09:59 +0100, Tommaso Teofili wrote:
   right, thanks I'll fix both.
  
   Tommaso
  
   2015-01-29 9:54 GMT+01:00 Joern Kottmann kottm...@gmail.com:
  
This file should have an AL header.
   
Jörn
   
On Thu, 2015-01-29 at 08:02 +, tomm...@apache.org wrote:
 Added:

   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
 URL:

   
  http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml?rev=1655546view=auto

   
  ==
 ---

   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
(added)
 +++

   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
Thu Jan 29 08:02:31 2015
 @@ -0,0 +1,58 @@
 +?xml version=1.0 encoding=UTF-8?
 +dictionary case_sensitive=false
 +entry count=1
 +tokenbrown/token
 +tokenfox/token
 +/entry
   
   
   
 
 
 




Re: svn commit: r1655546 - in /opennlp/trunk/opennlp-tools: pom.xml src/test/java/opennlp/tools/ngram/ src/test/java/opennlp/tools/ngram/NGramModelTest.java src/test/resources/opennlp/tools/ngram/ src

2015-01-29 Thread Joern Kottmann
In those serialization tests I usually write the Object into a byte
buffer, create it again from the byte buffer and then compare the two
objects, instead of the binary representation.

Could that solve the problem we have in this test?

Jörn 

On Thu, 2015-01-29 at 12:11 +0100, Tommaso Teofili wrote:
 I've just disabled that test, I'll fix it and re-enable it when done.
 
 Regards,
 Tommaso
 
 2015-01-29 10:51 GMT+01:00 Joern Kottmann kottm...@gmail.com:
 
  It still fails in the assert. I didn't check but I guess the build
  server has the same problem.
 
  Jörn
 
  On Thu, 2015-01-29 at 10:25 +0100, Tommaso Teofili wrote:
   even after my latest commit? If so I'll rearrange the test a bit.
  
   Tommaso
  
   2015-01-29 10:21 GMT+01:00 Joern Kottmann kottm...@gmail.com:
  
Or if that is a problem for the test, you could also tell RAT to ignore
it.
   
On my machine the test fails. The two strings don't match.
   
Jörn
   
On Thu, 2015-01-29 at 09:59 +0100, Tommaso Teofili wrote:
 right, thanks I'll fix both.

 Tommaso

 2015-01-29 9:54 GMT+01:00 Joern Kottmann kottm...@gmail.com:

  This file should have an AL header.
 
  Jörn
 
  On Thu, 2015-01-29 at 08:02 +, tomm...@apache.org wrote:
   Added:
  
 
   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
   URL:
  
 
   
  http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml?rev=1655546view=auto
  
 
   
  ==
   ---
  
 
   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
  (added)
   +++
  
 
   
  opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/ngram/ngram-model.xml
  Thu Jan 29 08:02:31 2015
   @@ -0,0 +1,58 @@
   +?xml version=1.0 encoding=UTF-8?
   +dictionary case_sensitive=false
   +entry count=1
   +tokenbrown/token
   +tokenfox/token
   +/entry
 
 
 
   
   
   
 
 
 




Re: svn commit: r1655238 - /opennlp/trunk/

2015-01-28 Thread Joern Kottmann
You didn't remove any entries in your recent commit to them.

We moved the main pom.xml from the opennlp folder to the root of the
project. Now using eclipse with m2e creates the project files there and
I thought it would be nice to have them in svn ignore.

Maybe it is possible to consolidate the various svn ignore files into
one at the root level.

Jörn

On Wed, 2015-01-28 at 09:53 +0100, Tommaso Teofili wrote:
 I guess I by error removed the previous ignore entries for the Eclipse
 files, sorry for the inconvenience.
 
 Tommaso
 
 2015-01-28 9:48 GMT+01:00 jo...@apache.org:
 
  Author: joern
  Date: Wed Jan 28 08:48:25 2015
  New Revision: 1655238
 
  URL: http://svn.apache.org/r1655238
  Log:
  Added eclipse files to svn:ignore.
 
  Modified:
  opennlp/trunk/   (props changed)
 
  Propchange: opennlp/trunk/
 
  --
  --- svn:ignore (original)
  +++ svn:ignore Wed Jan 28 08:48:25 2015
  @@ -1,2 +1,6 @@
   *.iml
   .idea
  +
  +.settings
  +
  +.project
 
 
 




Re: Word Sense Disambiguation

2015-01-19 Thread Joern Kottmann
Hello,

+1 from me to just go ahead and implement the proposed approach. One
goal of this implementation will be to figure out the interface we want
to have in OpenNLP for WSD.

We can later extend OpenNLP with more implementations which are taking
different approaches.

Jörn

On Thu, 2015-01-15 at 16:50 +0900, Anthony Beylerian wrote:
 Hello, 
 
 I'm new here, I previously mentioned to Jörn about my colleagues and myself 
 being interested in helping to implement this component, we were thinking of 
 starting with simple knowledge based approaches, although they do not yield 
 high accuracy, but as a first step they are relatively simple, would like 
 your opinion.
 
 Pei also mentioned cTAKES 
 (http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-wsd/ currently very 
 exploratory stages here) and YTEX 
 (https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08) is also 
 just exploring WSD for the healthcare domain. It's also currently 
 knowledge/ontology base for now... It would be great to see if OpenNLP 
 supports a general domain WSD
 
 Best, 
 
 Anthony
 




Build changed opennlp/pom.xml moved to root directory

2014-11-20 Thread Joern Kottmann
Hello everybody,

we changed the structure of the project slightly. The main pom.xml used
to be located in opennlp/pom.xml. This was done because an Eclipse
workspace can't have files at the root level. The Maven convention is to
have the file at the root level. I think it is time to move this file to
the root directory to not anymore confuse Maven users (and maybe some
tools) which expect the file in the root directory.

Please let me know if there are any objections to this.

To build OpenNLP from now on just go the trunk directory and type mvn
install.

Jörn



Re: 1.6.0 maven repo

2014-11-19 Thread Joern Kottmann
Hello,

yes, that should be the current state.

Can you please elaborate on the issue you have.
Do you get an old version?

We should try to make a release of 1.6.0, I think most issues
are already solved and remaining bugs we will uncover during the manual
testing phase. 

Jörn

On Wed, 2014-11-19 at 21:20 +0100, Rodrigo Agerri wrote:
 Hi
 
 Any chance to release snapshot repos to maven central? Or to an apache
 snapshots repo?
 
 It would make the use of current trunk via API much easier.
 
 Cheers
 
 Rodrigo




Re: Need to speed up the model creation process of OpenNLP

2014-11-19 Thread Joern Kottmann
The runtime almost scales with the number of cores your
CPU you have. If you have a 4 core CPU you might come down
from 3 hours to 1 hour.

To enabled it you need to train with the -params argument and provide
a config file for the learner. There are samples shipped with OpenNLP.

Jörn

On Wed, 2014-11-19 at 20:19 +, nikhil jain wrote:
 Hi Rodrigo,
 No, I am not using multi-threading, it's a simple Java program, took help 
 from openNLP documentation but it is worth mentioning over here is that as 
 the corpus is containing 4 million records so my Java program running in 
 eclipse was frequently giving me java heap space issue (out of memory issue) 
 so I investigate a bit and found that process was taking around 10GB memory 
 for building the model so i increased the memory to 10 GB using -Xmx 
 parameter. so it worked properly but took 3 hours.
 Thanks-NIkhil
   From: Rodrigo Agerri rage...@apache.org
  To: dev@opennlp.apache.org dev@opennlp.apache.org; nikhil jain 
 nikhil_jain1...@yahoo.com 
 Cc: us...@opennlp.apache.org us...@opennlp.apache.org 
  Sent: Wednesday, November 19, 2014 2:17 AM
  Subject: Re: Need to speed up the model creation process of OpenNLP

 Hi,
 
 Are you using multithreading, lots of threads, RAM memory?
 
 R
 
 
 
 
 On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
 nikhil_jain1...@yahoo.com.invalid wrote:
  Hi,
  I asked below question yesterday, did anyone get a chance to look at this.
  I am new in OpenNLP and really need some help. Please provide some clue or 
  link or example.
  ThanksNIkhil
   From: nikhil jain nikhil_jain1...@yahoo.com.INVALID
   To: us...@opennlp.apache.org us...@opennlp.apache.org; Dev at Opennlp 
  Apache dev@opennlp.apache.org
   Sent: Tuesday, November 18, 2014 12:02 AM
   Subject: Need to speed up the model creation process of OpenNLP
 
  Hi,
  I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
  have created a corpus of about 4 million records. When I am creating a 
  model out of the training set using openNLP API's in Eclipse using default 
  setting (cut-off 5 and iterations 100), process is taking a good amount of 
  time, around 2-3 hours.
  Can someone suggest me how can I reduce the time as I want to experiment 
  with different iterations but as the model creation process is taking so 
  much time, I am not able to experiment with it. This is really a time 
  consuming process.
  Please provide some feedback.
  Thanks in advance.Nikhil Jain
 
 
 
   




Re: Jenkins build is back to normal : OpenNLP_java8 #2

2014-10-29 Thread Joern Kottmann
Hello,

I added an OpenNLP Java 8 build to the build server.
This will hopefully inform us about problems with Java 8 in the future.

Jörn

On Wed, 2014-10-29 at 20:25 +, Apache Jenkins Server wrote:
 See https://builds.apache.org/job/OpenNLP_java8/2/
 




What should we do with the SF models?

2014-10-28 Thread Joern Kottmann
Hi all,

OpenNLP always came with a couple of trained models which were ready to
use for a few languages. The performance a user encounters with those
models heavily depends on their input text.

Especially the English name finder models which were trained on MUC 6/7
data perform very poorly these days if run on current news articles and
even worse on data which is not in the news domain.

Anyway, we often get judged on how well OpenNLP works just based on the
performance of those models (or maybe people who compare their NLP
systems against OpenNLP just love to have OpenNLP perform badly).

I think we are now at a point with those models were it is questionable
if having them is still an advantage for OpenNLP. The SourceForge page
is often blocked due to traffic limitations. We definitely have to act
somehow.

The old models have definitely some historic value and are used for
testing the release.

What should we do?

We could take them offline and advice our users to train their own
models on one of the various corpora we support. We could also do both
and place a prominent link to our corpora documentation on the download
page and in a less visible place a link to he historic SF models.

Jörn



Re: Build failed in Jenkins: OpenNLP #476

2014-10-27 Thread Joern Kottmann
On Mon, 2014-10-27 at 19:15 +, Rodrigo Agerri wrote:
 Hi,
 
 This is not caused by my latest commit, is it not?

Your last commit just triggered the build.
The build itself was successful. It failed afterwards when it tried to
deploy the artifacts to the snapshot repo with: 503 Service Temporarily
Unavailable

It probably works if we trigger it.

Jörn



Re: TokenNameFinder and Span probs

2014-05-07 Thread Joern Kottmann
Hello Mark,

+1 for your second solution. I believe that is much more intuitive than
calling a method afterwards to retrieve the prob for a Span.
it is easier to use because the prob is delivered as part of the result and
no user action is required to obtain it.

We could use this solution everywhere where a span gets returned.

Jörn



On Wed, May 7, 2014 at 2:18 AM, Mark G giaconiam...@gmail.com wrote:

 I am currently working on a project in which we are using NER to to pass
 toponyms into the GeoEntityLinker addon for geotagging and I am passing on
 the locations, entities, and other info into SOLR for indexing. Over the
 years I have noticed that the TokenNameFinder interface does not include
 all the probs() methods that the NameFinderME has, and furthermore the Span
 object does not have a double field for storing a prob for itself.  Also
 the sentenceDetector has a method called getSentenceProbabilities rather
 than probs().
 When I pass the Spans into the GeoEntityLinker/EntityLinker I can't get the
 probs anymore because they are not in the Span objects. I can always extend
 Span and add the field, or keep a 2D array of the probs for each sentence,
 but wanted to see what everyone thinks about
 1. adding the probs methods to the TokenNameFinder interface
 2. adding a prob field to Span (a double)
 3. Having the NameFinder return the prob with each Span so it doesn't have
 to be set after the call to find() using the double[] of probs
 4. Have the sentencedetectorME return its spans with a prob, add probs()
 method to the SentenceDetector interface, and deprecate the
 getSentenceProbabilities...

 Thoughts?



Re: Please review the 1.5.3 release announcement

2013-04-17 Thread Joern Kottmann
Yes, we are ready, everything is done. Lets send the announcement.

Jörn


On Wed, Apr 17, 2013 at 2:44 PM, William Colen william.co...@gmail.comwrote:

 Jörn, thank you for updating the web site. I already added a news item. Now
 are we ready to send the announce?



 On Mon, Apr 15, 2013 at 6:52 PM, Jörn Kottmann kottm...@gmail.com wrote:

  +1, lets wait until we updated the website, the distributeables are
  mirrord and the maven artifacts are
  available.
 
  I already promoted the maven repo and pushed the release to the dist
 area,
  but it might take a bit until everything
  is available, the mirrors might need 24 hours to mirror the
 distributables.
 
  Jörn
 
 
  On 04/15/2013 09:46 PM, William Colen wrote:
 
Hello,
 
  Please review the release announcement for the OpenNLP version 1.5.3.
 
  https://cwiki.apache.org/**confluence/display/OPENNLP/**
  ReleasePlanAndTasks1.5.3
 https://cwiki.apache.org/confluence/display/OPENNLP/ReleasePlanAndTasks1.5.3
 
 
  The announce mails will be sent to the users and announce@apache lists.
 
  Thank you,
  William
 
 
 



<    1   2