Re: new tool training

Russ, Daniel (NIH/CIT) [E] Fri, 28 Oct 2016 07:17:54 -0700

Hi Jörn,
1) I agree that the field values should be set in the init method for the 
QNTrainer.  Other minor changes I would make include adding a 
getNumberOfThreads() method to AbstractTrainer, and a default of 1.  I would 
modify GIS to use the value set in the TrainingParameters.  I also think this 
needs to be documented more clearly.  I can take a stab at some text and send 
it out to the community.

2) The One/TwoPassDataIndexers are very rigid classes because they do the 
indexing in the constructor.  The trainers deal with the rigidity by 
overloading methods that take a cutoff/sort/#Threads.  In my opinion, that was 
not the best way to do it.  There should have been a constructor and an index 
method.  At this point it may be a large refactoring effort for little gain.

As of 1.6.0, I see that there are 4 EventTrainers (GIS, NaiveBayesTrainer, 
PerceptronTrainer, QNTrainer).  All of these are AbstractEventTrainers.  So if 
we add a public MaxentModel train(DataIndexer di) to both the interface and the 
abstract class, nothing should break.  Now, if you are concerned, the 
train(ObjectStream<Event>) method calls doTrain(DataIndexer), So the behavior 
is exactly the same for the two methods, I just call the doTrain(DataIndexer).  
This would not be used in the 1.5.x versions.

I don’t like the idea of using a factory because the DataIndexer requires you 
pass parameters into the constructor.  It may be possible to use reflection, 
but why make like so difficult.  Let the client give the trainer the 
dataindexer.

In my not-so-expert opinion, I think adding a train(Dataindexer) method to 
EventTrainer is the best way forward.

Daniel

=============
On 10/27/16, 3:17 PM, "Joern Kottmann" <kottm...@gmail.com> wrote:

    On Thu, 2016-10-27 at 16:04 +0000, Russ, Daniel (NIH/CIT) [E] wrote:
    > Hello,
    > 
    >    Okay, I found why my toy worked.  I call
    > AbstractEventTrainer.doTrain(DataIndexer) as oppose to
    > AbstractEventTrainer.train(ObjectStream<Event>).   The train method
    > calls isValid(). That sets the value of threads in QNTrainer.
    > 
    >    Thank you for making me do this.  I don’t think doTrain() should
    > be exposed anymore.  A new method train(DataIndexer) that calls
    > isValid and then doTrain(indexer) is probably a better idea.  Is it
    > important to calculate the hash of all events?

    This should really be in the init method and not in isValid. What do
    you think?

    The init method was added for the plugable ml support in 1.6.0 and I
    believe I didn't see that this should be moved there.

    Jörn

    > Comment 3:
    > Right I want to change the dataindexer.
    > 
    > So I have multiple models that classify data (Job descriptions) into
    > Occupational Codes.  I know what the codes are aprori, and even if
    > they are not in the training data, I need to make sure that there is
    > SOME probability for the codes.  More importantly for each job
    > description, I need to compare the probabilities returned for each
    > output.  By forcing the output indices to have the same values, I can
    > quickly compare them without re-mapping the output.
    > 
    > I tried to extend OnePassDataIndex, but the indexing occurs during
    > object construction, so I cannot set the known outputs before
    > indexing occurs.  
    > 
    > Of course I would not need the getDataIndexer() method,  but it is
    > defined in the Abstract class, why not in the Interface

    The thing is that with the current interface we can support
    implementations which don't use the Data Indexer. This can be the case
    when it relies on external machine learning libraries. Since 1.6.0 we
    have plugable ml support.

    I looked closer now, the getDataIndexer is a factory method for the
    Data Indexer. Maybe it would make sense to allow to specify a custom
    class for data indexing as part of the training parameters? Then the
    trainer who use the Data Indexer can just support that mechanism.

    Jörn

Re: new tool training

Reply via email to