Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-62070537
@akopich This would be a great set of models to add to MLlib! I wanted to
follow up about some comments I saw before, and see if we can get this PR
moving again.
### Public/private interfaces
(@mengxr had commented about this.)
The list of public classes which SparkQA lists is incomplete, so itâs
important to check manually. Since this is such a big PR, it will make
reviewing much easier if you can summarize (a) public interfaces and (b)
private implementation structure.
Some examples of current issues are:
* Enumerator should be private
* DocumentOverTopicDistributionRegularizer and its extensions should
probably be parameters (e.g., String parameters) passed to the learning
algorithm. We may want to update those classes later on, and exposing them to
the user makes it much harder to change the API.
Essentially, weâd like to keep the public interface as simple as
possible. That way, we can make internal changes without users having to
change their code.
### Scala style
sbt/sbt scalastyle checks some style points but cannot fix everything.
It's really important for us to stick with a uniform style since it makes it
much easier to read the many parts of the Spark code. Iâd check against the
Spark style guide which @mengxr mentioned.
### Code structure
Improving the directory structure will be helpful for reviewers. Moving
private classes into an impl/ directory would help.
Some code implements linear algebra operations which donât really belong
in topicmodeling/. Those should ideally be moved to mllib/linalg/ if they can
fit nicely with the current API. It will also be important to make sure the
implementations are indeed faster than current linalg/ code.
Similarly, the Dirichlet computation could fit better in mllib/stat/
As @mengxr suggested, some classes appear similar and might work well as a
combination. I agree about PLSA and RobustPLSA; those seem similar enough that
combining would be good. I could imagine either:
* Combine into 1 class, and have a robust: Boolean parameter specifying
which method to use.
* Have 1 subclass the other to avoid code duplication.
(Maybe for other classes like GlobalCounters & RobustGlobalCounters too.)
### Other comments
For getting this PR merged quickly, keeping it as simple as possible will
be ideal. If there are extra features, those should ideally be removed until
later and sent as a separate PR.
Proper testing (and figuring out issues like iterations taking longer and
longer) will be important. It would be ideal if it could be tested against
another open-source implementation of these algorithms.
Again, thanks for the PR, and I hope we can get it ready to merge without
extra delay!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]