Github user helenahm commented on the issue:
https://github.com/apache/incubator-hivemall/pull/93
I will do more tests too, as I actually need the model for a project. So I
plan to test it under "load" too. I will write about the results.
It may have similar issues that Random Forest has. You are right. In a
nutshell the implementation and memory concerns are similar.
The implementation is as scalable as the implementation of Random Forest:
one or more models per mapper and then a UDAF that combines all the learned
models into one final model.
I still use the Random Forest even though on EMR r4 machines _numTrees
greater than 1 does not work for me for my dataset. MaxEnt though will give me
a better model, I think, I will not have to think whether there is overfitting
because of the tree structure, etc.
Iterative Scaling can be re-written from scratch too without using any
third-party software. This is an option too.
I am sure that NLP community will more likely accept the implementation and
will use it in exactly the way those guys have written it. We very much value
Adwait Ratnaparkhi's work. Many published articles use exactly that Max Ent
implementation. That means that people will be able to use HiveMall and compare
their newer results with results of their previous work.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---