Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-65682412
@akopich Thanks for the responses! Follow-ups:
(1) Users implementing their own regularizers
You're right that this would be nice to have. If we include it, then
perhaps we can spend more time to have a clean API for regularizers which can
be re-used in other algorithms which try to optimize an objective. (E.g.,
ideally, Dirichlet regularization would be implemented such that any algorithm
which needed a Dirichlet regularizer (or prior) could reuse your code.) Here
are some thoughts on that:
* All of the regularizers here operate on each Matrix element individually.
If that will be the case for all useful regularizers, then the regularizer API
could operate per-element (to be simpler), and the code using the regularizers
could iterate over all elements as needed.
* The regularizer API should use general terminology, such as:
* penalty(param: Double): Double
* gradient(param: Double): Double
Alternatively, we could use this development path to avoid having to decide
on the API right now:
* In this PR, keep the regularizer types public, but make all of their
internals private[mllib]. This way, the API choices are kept private for now.
* In a later PR, the regularizer API can be refined and made public so that
users can implement their own pluggable types.
(2) Regular and Robust in the same class
You should be able to extract the functionality you need. E.g., if the
newIteration method knows that it has a DocumentParameters instance, it can
call getNewTheta(). If the actual type of the instance is
RobustDocumentParameters, then RobustDocumentParameters.getNewTheta() will be
called. But I would recommend having both classes implement getNewTheta() with
the same visibility (private/public), and also to use the âoverrideâ
keyword in RobustDocumentParameters.
You should be able to abstract any other needed functionality similarly.
(3) PLSA and RobustPLSA code duplication
Looking more closely, I think youâre right about it being hard to
abstract further. (Iâll let you know if I have ideas.)
(4) Float vs. Double
True, it may be worth the trouble to use Float to save on memory and
communication. I donât know enough about PLSA to know how important
numerical precision is in general. Your approach sounds reasonable then. One
alternative would be to user Breeze matrices (but not in public APIs!), but
Iâd only suggest that if it will simplify or shorten code.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]