[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

jkbradley Thu, 04 Dec 2014 10:54:25 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1269#issuecomment-65682412
  
    @akopich  Thanks for the responses!  Follow-ups:
    
    (1) Users implementing their own regularizers
    
    You're right that this would be nice to have.  If we include it, then 
perhaps we can spend more time to have a clean API for regularizers which can 
be re-used in other algorithms which try to optimize an objective.  (E.g., 
ideally, Dirichlet regularization would be implemented such that any algorithm 
which needed a Dirichlet regularizer (or prior) could reuse your code.)  Here 
are some thoughts on that:
    * All of the regularizers here operate on each Matrix element individually. 
 If that will be the case for all useful regularizers, then the regularizer API 
could operate per-element (to be simpler), and the code using the regularizers 
could iterate over all elements as needed.
    * The regularizer API should use general terminology, such as:
     * penalty(param: Double): Double
     * gradient(param: Double): Double
    
    Alternatively, we could use this development path to avoid having to decide 
on the API right now:
    * In this PR, keep the regularizer types public, but make all of their 
internals private[mllib].  This way, the API choices are kept private for now.
    * In a later PR, the regularizer API can be refined and made public so that 
users can implement their own pluggable types.
    
    (2) Regular and Robust in the same class
    
    You should be able to extract the functionality you need.  E.g., if the 
newIteration method knows that it has a DocumentParameters instance, it can 
call getNewTheta().  If the actual type of the instance is 
RobustDocumentParameters, then RobustDocumentParameters.getNewTheta() will be 
called.  But I would recommend having both classes implement getNewTheta() with 
the same visibility (private/public), and also to use the âoverrideâ 
keyword in RobustDocumentParameters.
    
    You should be able to abstract any other needed functionality similarly.
    
    (3) PLSA and RobustPLSA code duplication
    
    Looking more closely, I think youâre right about it being hard to 
abstract further.  (Iâll let you know if I have ideas.)
    
    (4) Float vs. Double
    
    True, it may be worth the trouble to use Float to save on memory and 
communication.  I donât know enough about PLSA to know how important 
numerical precision is in general.  Your approach sounds reasonable then.  One 
alternative would be to user Breeze matrices (but not in public APIs!), but 
Iâd only suggest that if it will simplify or shorten code.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

Reply via email to