Hi Jörn, 1) I agree that the field values should be set in the init method for the QNTrainer. Other minor changes I would make include adding a getNumberOfThreads() method to AbstractTrainer, and a default of 1. I would modify GIS to use the value set in the TrainingParameters. I also think this needs to be documented more clearly. I can take a stab at some text and send it out to the community.
2) The One/TwoPassDataIndexers are very rigid classes because they do the indexing in the constructor. The trainers deal with the rigidity by overloading methods that take a cutoff/sort/#Threads. In my opinion, that was not the best way to do it. There should have been a constructor and an index method. At this point it may be a large refactoring effort for little gain. As of 1.6.0, I see that there are 4 EventTrainers (GIS, NaiveBayesTrainer, PerceptronTrainer, QNTrainer). All of these are AbstractEventTrainers. So if we add a public MaxentModel train(DataIndexer di) to both the interface and the abstract class, nothing should break. Now, if you are concerned, the train(ObjectStream<Event>) method calls doTrain(DataIndexer), So the behavior is exactly the same for the two methods, I just call the doTrain(DataIndexer). This would not be used in the 1.5.x versions. I don’t like the idea of using a factory because the DataIndexer requires you pass parameters into the constructor. It may be possible to use reflection, but why make like so difficult. Let the client give the trainer the dataindexer. In my not-so-expert opinion, I think adding a train(Dataindexer) method to EventTrainer is the best way forward. Daniel ============= On 10/27/16, 3:17 PM, "Joern Kottmann" <kottm...@gmail.com> wrote: On Thu, 2016-10-27 at 16:04 +0000, Russ, Daniel (NIH/CIT) [E] wrote: > Hello, > > Okay, I found why my toy worked. I call > AbstractEventTrainer.doTrain(DataIndexer) as oppose to > AbstractEventTrainer.train(ObjectStream<Event>). The train method > calls isValid(). That sets the value of threads in QNTrainer. > > Thank you for making me do this. I don’t think doTrain() should > be exposed anymore. A new method train(DataIndexer) that calls > isValid and then doTrain(indexer) is probably a better idea. Is it > important to calculate the hash of all events? This should really be in the init method and not in isValid. What do you think? The init method was added for the plugable ml support in 1.6.0 and I believe I didn't see that this should be moved there. Jörn > Comment 3: > Right I want to change the dataindexer. > > So I have multiple models that classify data (Job descriptions) into > Occupational Codes. I know what the codes are aprori, and even if > they are not in the training data, I need to make sure that there is > SOME probability for the codes. More importantly for each job > description, I need to compare the probabilities returned for each > output. By forcing the output indices to have the same values, I can > quickly compare them without re-mapping the output. > > I tried to extend OnePassDataIndex, but the indexing occurs during > object construction, so I cannot set the known outputs before > indexing occurs. > > Of course I would not need the getDataIndexer() method, but it is > defined in the Abstract class, why not in the Interface The thing is that with the current interface we can support implementations which don't use the Data Indexer. This can be the case when it relies on external machine learning libraries. Since 1.6.0 we have plugable ml support. I looked closer now, the getDataIndexer is a factory method for the Data Indexer. Maybe it would make sense to allow to specify a custom class for data indexing as part of the training parameters? Then the trainer who use the Data Indexer can just support that mechanism. Jörn