Re: new tool training

2016-11-09 Thread Cohan Sujay Carlos
Joern,

Regarding the deprecation of GIS, I noticed that OpenNLP had been
refactored between 1.5.2 and 1.6.0 to move the constant *MAXENT_VALUE
*
(that
is used to select a MaxEnt classifier) to the GIS class from TrainUtil (
*opennlp.model.TrainUtil*).

Since MAXENT_VALUE had been moved to GIS, I put the corresponding constant
for the Naive Bayes classifier (*NAIVE_BAYES_VALUE*) into the
NaiveBayesTrainer class.

So, if you're moving constants back to something like the TrainUtil, you
may want to move the NAIVE_BAYES_VALUE over too.

Cohan Sujay Carlos

On Mon, Nov 7, 2016 at 2:44 PM, Joern Kottmann  wrote:

> I also opened a Jira now for the refactoring effort:
> https://issues.apache.org/jira/browse/OPENNLP-880
>
> Jörn
>
> On Sun, Nov 6, 2016 at 4:41 PM, Joern Kottmann  wrote:
>
> > On Fri, 2016-11-04 at 12:58 +, Russ, Daniel (NIH/CIT) [E] wrote:
> > > Hi Jörn,
> > >Currently, GIS is a wrapper for the GISTrainer, which ironically
> > > is NOT an EventTrainer.  There are several ways to handle this.
> > >
> > > First, don’t touch GIS.  This is the class that people know best.  It
> > > is easy to use if you use the parameter hard coded in the class.  We
> > > can make GISTrainer an Event trainer and refactor it appropriately to
> > > handle parameters in the map.
> > >
> > > Second, refactor GIS, but keep the static methods in the code. This
> > > allows user to keep their code working.
> > >
> > > Third, refactor GIS and deprecate the static methods.
> > >
> > > I don’t see any value to the GIS class other than an Adapter.  My
> > > vote would be the fix the original problem of GISTrainer NOT being an
> > > event trainer.
> >
> >
> > I think we are lucky that GISTrainer is not a public class, so that
> > allows us to modify it as we need. I agree with you, we should make it
> > an EventTrainer.
> >
> > The GIS class should probably just be deprecated and then removed
> > later.
> >
> > Regards,
> > Jörn
> >
>


Re: new tool training

2016-10-31 Thread Russ, Daniel (NIH/CIT) [E]
Ok, I will send you patch with a refactored GIS (can you find the jira number). 
 If we refactor DataIndexer, I have no problem using a factory. I am happy to 
extend OnePassDataIndexer. I won’t have to copy code.  I will have to look at 
DataIndexer again to before I comment on specifics.
Daniel


On 10/29/16, 8:45 AM, "Joern Kottmann"  wrote:

On Fri, 2016-10-28 at 14:16 +, Russ, Daniel (NIH/CIT) [E] wrote:
> Hi Jörn,
> 1) I agree that the field values should be set in the init method for
> the QNTrainer.  Other minor changes I would make include adding a
> getNumberOfThreads() method to AbstractTrainer, and a default of
> 1.  I would modify GIS to use the value set in the
> TrainingParameters.  I also think this needs to be documented more
> clearly.  I can take a stab at some text and send it out to the
> community.

Sounds good.

I had a look at the code and it looks like it was never refactored to
fit properly in the new structure we created. I think the multi-
threaded GIS training is actually working, but couldn't see how the
parameter is hooked up to the training code.

Suggestion: We open a jira issue to refactor the GIS training to make
use of the init method, have all variables externalized (e.g. smoothing
factor, print debug output to the console, etc.) and other clean up we
think make sense. 

The problem with GIS is that people take it as an example on how things
should be done when they implement a new trainer.

Another issue is that we should have some validation of parameters,
e.g. threads can only be one for perceptron.

> 2) The One/TwoPassDataIndexers are very rigid classes because they do
> the indexing in the constructor.  The trainers deal with the rigidity
> by overloading methods that take a cutoff/sort/#Threads.  In my
> opinion, that was not the best way to do it.  There should have been
> a constructor and an index method.  At this point it may be a large
> refactoring effort for little gain.

Yes, I agree with the first part. I don't think it will be so much
effort to add a new consructor and the index method. The old
constructors can stay and be deprecated until we remove them.

Should we open a jira issue for this task?

> As of 1.6.0, I see that there are 4 EventTrainers (GIS,
> NaiveBayesTrainer, PerceptronTrainer, QNTrainer).  All of these are
> AbstractEventTrainers.  So if we add a public MaxentModel
> train(DataIndexer di) to both the interface and the abstract class,
> nothing should break.  Now, if you are concerned, the
> train(ObjectStream) method calls doTrain(DataIndexer), So the
> behavior is exactly the same for the two methods, I just call the
> doTrain(DataIndexer).  This would not be used in the 1.5.x versions.
> 
> I don’t like the idea of using a factory because the DataIndexer
> requires you pass parameters into the constructor.  It may be
> possible to use reflection, but why make like so difficult.  Let the
> client give the trainer the dataindexer.

We use this kind of factory in various places already, I think it would
be a good solution, because then people can switch the Data Indexer
without writing training code and it will work with all components we
have out of the box. I agree it would only work properly if we do the
refactoring.

In OPENNLP-830 someone said the TwoPassDataIndexer can be implemented
better performing these days with non-blocking IO. I don't think it
make sense to change the current implementation. But it would be nice
to add an improved version as a third option. If we have plugable Data
Indexer support this idea could be easily explored.

This could be done exactly like it is done for the trainer in:
TrainerFactory.getEventTrainer. 

And we also need a jira for this.

> In my not-so-expert opinion, I think adding a train(Dataindexer)
> method to EventTrainer is the best way forward.

We have might have the case that certain trainers don't support the
Data Indexer, for example deeplearning libraries, but for them we
probably have to add a new trainer type anyway.

Would this work also when we refactor the Data Indexer? Then a user a
would create a Data Indexer instance, calls the index method and passes
it in. Should be fine. 

I have time I can dedicate to help out with refactoring things.

Jörn




Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 16:04 +, Russ, Daniel (NIH/CIT) [E] wrote:
> Is it important to calculate the hash of all events?

I missed that question. No this is included for debug purposes only,
with the has it is possible to see if two models have been trained from
exactly the same source with identical feature generation. I used this
a log to debug isseus between OpenNLP versions.

Jörn


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 16:04 +, Russ, Daniel (NIH/CIT) [E] wrote:
> Hello,
> 
>    Okay, I found why my toy worked.  I call
> AbstractEventTrainer.doTrain(DataIndexer) as oppose to
> AbstractEventTrainer.train(ObjectStream).   The train method
> calls isValid(). That sets the value of threads in QNTrainer.
> 
>    Thank you for making me do this.  I don’t think doTrain() should
> be exposed anymore.  A new method train(DataIndexer) that calls
> isValid and then doTrain(indexer) is probably a better idea.  Is it
> important to calculate the hash of all events?


This should really be in the init method and not in isValid. What do
you think?

The init method was added for the plugable ml support in 1.6.0 and I
believe I didn't see that this should be moved there.

Jörn


Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, 2016-10-27 at 15:49 +, Russ, Daniel (NIH/CIT) [E] wrote:
> 
> Comment 2:
> Do you have a preference where the variable should go?  I think
> AbstractTrainer is the appropriate place for PSF variable dealing
> with ALL trainers, so Threads_(P/D) should be there.  I would remove
> and refactor out of TrainingParams.

TrainingParameters is the class which is parsing the passed in params
file. There is has to know about "Algorithm" all the others are
specific to the trainer implementation.

I think AbstractTrainer is probably a good place for PSF variables
which deal with many/most trainers.


> Comment 3:
> Right I want to change the dataindexer.
> 
> So I have multiple models that classify data (Job descriptions) into
> Occupational Codes.  I know what the codes are aprori, and even if
> they are not in the training data, I need to make sure that there is
> SOME probability for the codes.  More importantly for each job
> description, I need to compare the probabilities returned for each
> output.  By forcing the output indices to have the same values, I can
> quickly compare them without re-mapping the output.
> 
> I tried to extend OnePassDataIndex, but the indexing occurs during
> object construction, so I cannot set the known outputs before
> indexing occurs.  
> 
> Of course I would not need the getDataIndexer() method,  but it is
> defined in the Abstract class, why not in the Interface


The thing is that with the current interface we can support
implementations which don't use the Data Indexer. This can be the case
when it relies on external machine learning libraries. Since 1.6.0 we
have plugable ml support.

I looked closer now, the getDataIndexer is a factory method for the
Data Indexer. Maybe it would make sense to allow to specify a custom
class for data indexing as part of the training parameters? Then the
trainer who use the Data Indexer can just support that mechanism.

Jörn


Re: new tool training

2016-10-27 Thread Russ, Daniel (NIH/CIT) [E]
Hello,

   Okay, I found why my toy worked.  I call 
AbstractEventTrainer.doTrain(DataIndexer) as oppose to 
AbstractEventTrainer.train(ObjectStream).   The train method calls 
isValid(). That sets the value of threads in QNTrainer.

   Thank you for making me do this.  I don’t think doTrain() should be exposed 
anymore.  A new method train(DataIndexer) that calls isValid and then 
doTrain(indexer) is probably a better idea.  Is it important to calculate the 
hash of all events?

Daniel


On 10/27/16, 11:49 AM, "Russ, Daniel (NIH/CIT) [E]"  wrote:

Comment 1:
Here is my Stacktrace:

Exception in thread "main" java.lang.IllegalArgumentException: Number of 
threads must 1 or larger
at 
opennlp.tools.ml.maxent.quasinewton.ParallelNegLogLikelihood.(ParallelNegLogLikelihood.java:50)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.trainModel(QNTrainer.java:165)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:152)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:1)
at gov.nih.cit.test.smlm.SOCcerME.train(SOCcerME.java:131)
at gov.nih.cit.test.smlm.SOCcerME.main(SOCcerME.java:310)

So I threw together a small toy to get the same error.  It worked…  Let me 
track this down some more.



Re: new tool training

2016-10-27 Thread Russ, Daniel (NIH/CIT) [E]
Comment 1:
Here is my Stacktrace:

Exception in thread "main" java.lang.IllegalArgumentException: Number of 
threads must 1 or larger
at 
opennlp.tools.ml.maxent.quasinewton.ParallelNegLogLikelihood.(ParallelNegLogLikelihood.java:50)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.trainModel(QNTrainer.java:165)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:152)
at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:1)
at gov.nih.cit.test.smlm.SOCcerME.train(SOCcerME.java:131)
at gov.nih.cit.test.smlm.SOCcerME.main(SOCcerME.java:310)

So I threw together a small toy to get the same error.  It worked…  Let me 
track this down some more.


Comment 2:
Do you have a preference where the variable should go?  I think AbstractTrainer 
is the appropriate place for PSF variable dealing with ALL trainers, so 
Threads_(P/D) should be there.  I would remove and refactor out of 
TrainingParams.

Comment 3:
Right I want to change the dataindexer.

So I have multiple models that classify data (Job descriptions) into 
Occupational Codes.  I know what the codes are aprori, and even if they are not 
in the training data, I need to make sure that there is SOME probability for 
the codes.  More importantly for each job description, I need to compare the 
probabilities returned for each output.  By forcing the output indices to have 
the same values, I can quickly compare them without re-mapping the output.

I tried to extend OnePassDataIndex, but the indexing occurs during object 
construction, so I cannot set the known outputs before indexing occurs.  

Of course I would not need the getDataIndexer() method,  but it is defined in 
the Abstract class, why not in the Interface


On 10/27/16, 11:15 AM, "Joern Kottmann"  wrote:

On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] <
dr...@mail.nih.gov> wrote:

> Hello,
>
> Background:
>I am developing a tool that uses OpenNLP.  I have a model that extends
> BaseModel, and several AbstractModels.  I allow the user (myself) to
> specify the TrainerType (GIS/QN) for each model by using a list of
> TrainingParameters.
>
> Potential Bugs:
>
> 1)Whenever I use QNTrainer, I get an error (number of threads <1).  I
> think the problem is that the parameters are initialized in the isValid()
> method instead of the init() method.  This works for GIS because in the
> doTrain(DataIndexer) method, the number of threads is a local variable
> taken from the TrainingParameters not a field in GIS.  This leads to
> another question. When it the isValid() method supposed to be called?  I 
am
> surprised that the TrainerFactory does not call it.
>
>
It should be be called from the factory I think. Currently it is only
called when the training starts in
AbstractEventTrainer.train(ObjectStream). It is always better to
make things fail as early as possible.

Can you share the exception stack trace? I don't really understand yet why
you get this error with the QNTrainer. I would like to investigate that.

   


>
> 2)The psf (public static final) String variables used by the
> TrainingParameters are all over the place.  The variables
> THEADS_(PARAM/DEFAULT) are defined in both QNTrainer and
> TrainingParameters.  It should be defined in one of the places. I am not
> sure that AbstractTrainer isn’t the best place to put THREADS_(P/D).  It
> isn’t just the variables Threads_(P/D), All the Training psf String
> variables from TrainingParameters are duplicated in AbstractTrainer.
>
>
I agree, the commonly used variables should only be in one place. Some
trainer have specific variables which are not shared. It would be nice to
get this re-factored.



>
> 3)Should the Interface EventTrainer have a doTrainDataIndexer and a
> getDataIndexer method?  This is important to me because I extended
> OnePassDataIndexer to pre-assign the outputs.  I know the outputs aprori,
> and I want to quick combine the results of the multiple models.  Since the
> getEventTrainer returns an EventTrainer instead of an 
AbstractEventTrainer,
> I cannot call doTrain(DataIndexer).  I cannot use the
> doTrain(ObjectStream); it creates a new OnePassIndexer.
>


I think it would be fine to add a second train method to EventTrainer which
takes a DataIndexer. The current train method should probably be changed a
bit, and not do the init things.
It would like to understand your usage here a bit better.

So you want to have control over the DataIndexer which is used for
training, right?
Another option could be to have a second train(ObjectStream,
DataIndexer) method.

And why would you need a getDataIndexer method if you can pass in your 

Re: new tool training

2016-10-27 Thread Joern Kottmann
On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] <
dr...@mail.nih.gov> wrote:

> Hello,
>
> Background:
>I am developing a tool that uses OpenNLP.  I have a model that extends
> BaseModel, and several AbstractModels.  I allow the user (myself) to
> specify the TrainerType (GIS/QN) for each model by using a list of
> TrainingParameters.
>
> Potential Bugs:
>
> 1)Whenever I use QNTrainer, I get an error (number of threads <1).  I
> think the problem is that the parameters are initialized in the isValid()
> method instead of the init() method.  This works for GIS because in the
> doTrain(DataIndexer) method, the number of threads is a local variable
> taken from the TrainingParameters not a field in GIS.  This leads to
> another question. When it the isValid() method supposed to be called?  I am
> surprised that the TrainerFactory does not call it.
>
>
It should be be called from the factory I think. Currently it is only
called when the training starts in
AbstractEventTrainer.train(ObjectStream). It is always better to
make things fail as early as possible.

Can you share the exception stack trace? I don't really understand yet why
you get this error with the QNTrainer. I would like to investigate that.


>
> 2)The psf (public static final) String variables used by the
> TrainingParameters are all over the place.  The variables
> THEADS_(PARAM/DEFAULT) are defined in both QNTrainer and
> TrainingParameters.  It should be defined in one of the places. I am not
> sure that AbstractTrainer isn’t the best place to put THREADS_(P/D).  It
> isn’t just the variables Threads_(P/D), All the Training psf String
> variables from TrainingParameters are duplicated in AbstractTrainer.
>
>
I agree, the commonly used variables should only be in one place. Some
trainer have specific variables which are not shared. It would be nice to
get this re-factored.



>
> 3)Should the Interface EventTrainer have a doTrainDataIndexer and a
> getDataIndexer method?  This is important to me because I extended
> OnePassDataIndexer to pre-assign the outputs.  I know the outputs aprori,
> and I want to quick combine the results of the multiple models.  Since the
> getEventTrainer returns an EventTrainer instead of an AbstractEventTrainer,
> I cannot call doTrain(DataIndexer).  I cannot use the
> doTrain(ObjectStream); it creates a new OnePassIndexer.
>


I think it would be fine to add a second train method to EventTrainer which
takes a DataIndexer. The current train method should probably be changed a
bit, and not do the init things.
It would like to understand your usage here a bit better.

So you want to have control over the DataIndexer which is used for
training, right?
Another option could be to have a second train(ObjectStream,
DataIndexer) method.

And why would you need a getDataIndexer method if you can pass in your own
instance?


Jörn