[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

jkbradley Thu, 06 Nov 2014 15:26:30 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1269#issuecomment-62070537
  
    @akopich  This would be a great set of models to add to MLlib!  I wanted to 
follow up about some comments I saw before, and see if we can get this PR 
moving again.
    
    ### Public/private interfaces
    
    (@mengxr had commented about this.)
    
    The list of public classes which SparkQA lists is incomplete, so itâs 
important to check manually.  Since this is such a big PR, it will make 
reviewing much easier if you can summarize (a) public interfaces and (b) 
private implementation structure.
    
    Some examples of current issues are:
    * Enumerator should be private
    * DocumentOverTopicDistributionRegularizer and its extensions should 
probably be parameters (e.g., String parameters) passed to the learning 
algorithm.  We may want to update those classes later on, and exposing them to 
the user makes it much harder to change the API.
    
    Essentially, weâd like to keep the public interface as simple as 
possible.  That way, we can make internal changes without users having to 
change their code.
    
    ### Scala style
    
    sbt/sbt scalastyle checks some style points but cannot fix everything.  
It's really important for us to stick with a uniform style since it makes it 
much easier to read the many parts of the Spark code.  Iâd check against the 
Spark style guide which @mengxr mentioned.
    
    ### Code structure
    
    Improving the directory structure will be helpful for reviewers.  Moving 
private classes into an impl/ directory would help.
    
    Some code implements linear algebra operations which donât really belong 
in topicmodeling/.  Those should ideally be moved to mllib/linalg/ if they can 
fit nicely with the current API.  It will also be important to make sure the 
implementations are indeed faster than current linalg/ code.
    
    Similarly, the Dirichlet computation could fit better in mllib/stat/
    
    As @mengxr suggested, some classes appear similar and might work well as a 
combination.  I agree about PLSA and RobustPLSA; those seem similar enough that 
combining would be good.  I could imagine either:
    * Combine into 1 class, and have a robust: Boolean parameter specifying 
which method to use.
    * Have 1 subclass the other to avoid code duplication.
    (Maybe for other classes like GlobalCounters & RobustGlobalCounters too.)
    
    ### Other comments
    
    For getting this PR merged quickly, keeping it as simple as possible will 
be ideal.  If there are extra features, those should ideally be removed until 
later and sent as a separate PR.
    
    Proper testing (and figuring out issues like iterations taking longer and 
longer) will be important.  It would be ideal if it could be tested against 
another open-source implementation of these algorithms.
    
    Again, thanks for the PR, and I hope we can get it ready to merge without 
extra delay!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

Reply via email to