Github user helenahm commented on the issue:
https://github.com/apache/incubator-hivemall/pull/93
It will include some work.
Let me explain.
You were right when you have said that OpenNLP implementation is poor
memory-wise. Indeed, they store data in [][] and few times. Using their code
directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of
data rows. Newer version of code has same problems.) And you were right about
the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data.
Thus, more or less, I have changed all the [][] (related to input data) to
CSRMatrix and [][] holding weights to DoKMatrix.
To explain that more, it is best to look at source code for the GISTrainer.
In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer.
The links are below.
Newer GISTrainer:
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java
Older (3.0.0) GISTrainer:
https://sourceforge.net/projects/maxent/files/ - whole achive
GISTrainer attached:
[GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt)
Hivemall GISTrainer:
https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java
Notice how trainModel of BigGISTrainer gets MatrixForTraining
(https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java),
that contains references to Matrix, and outcomes. This is CSRMatrix.
And row data is collected from the CSRMatrix in MatrixForTraining instead
of the double[][].
when
ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(),
di.getOMap());
(they use this convenience Event thing to work with a row of data. Instead
of storing a List of Events in memory the modified code also builds an event
when needed.)
and results are stored in
Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][]
again.
GISTrainer did not change very dramatically. If 3.0.0 training is reliable
enough, I would, of course, consider the existing version as 1.0, and did all
the effort to adapt GISTrainer later on. It makes sense to do that, I totally
agree. And perhaps it makes sense to continue after that to understanding
training process in greater details and perhaps write a newer comparable
trainer that will be independent from OpenNLP.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---