[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

bgreeven Mon, 08 Sep 2014 23:01:04 -0700

Github user bgreeven commented on the pull request:

    https://github.com/apache/spark/pull/1290#issuecomment-54927726
  
    Thanks for your feedback. Your points are very helpful indeed.
    
    Here is my response:
    
      1.  The user guide is for normal users and it should focus on how to use 
ANN. If we want to leave some notes for developers, we can append a section at 
the end.
    [bgreeven]: Sure. I think the user guide needs a lot of revision anyway, 
but as you said, it is better to wait until the code is more stable to update 
the user guide.
    
      1.  We don't ask users to treat unit tests as demos or examples. Instead, 
we put a short code snippet in the user guide and put a complete example under 
examples/.
    [bgreeven]: OK, I'll see how to convert the demo into a unit test.
    
      1.  GeneralizedModel and GeneralizedAlgorithm are definitely out of the 
scope of this PR and they should not live under mllib.ann. We can make a 
separate JIRA to discuss the APIs. Could you remove them in this PR?
      2.  predict outputs the prediction for the first node. Would the first 
node be the only special node? How about having predict(v) output the full 
prediction and predict(v, i) output the prediction to the i-th node?
    
    [bgreeven]: I certainly understand your concerns on points 3 and 4. My 
reasons for adding GeneralizedModel and GeneralizedAlgorithm were, that I see 
more uses of ANNs than classification only. A LabeledPoint implementation would 
restrict the output to essentially a one dimensional value. If you want to 
learn e.g. a multidimensional function (such as in the demo), then you need 
something more general than LabeledPoint.
    
    The architecture of taking only the first element of an output vector is 
for legacy reasons. GeneralizedLinearModel (to which GeneralizedModel was 
modelled) as well as the ClassificationModel only output a one dimensional 
output value, hence I made the interface of predict(v) the same and created a 
separate function predictV(v) to output the multidimensional result.
    
    I think we can indeed open a second JIRA to discuss this, since I think 
there can also be other uses for multidimensional output  than just 
classification.
    
      1.  Could you try to use LBFGS instead of GradientDescent?
    [bgreeven] Tried it, and that works too. Actually, I would like to make the 
code more flexible, to allow for replacing the optimisation function. There is 
a lot of research in (parallelisation of) training ANNs, so the future may 
bring better optimisation strategies, and it should be easy to plug those into 
the existing code.
    
      1.  Please replace for loops by while loops. The latter is faster in 
Scala.
    [bgreeven] Makes sense. Will do so.
    
      1.  Please follow the Spark Code Style 
Guide<https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide> 
and update the style, e.g., a. remove space after ( and before ) b. add 
ScalaDoc for all public classes and methods c. line width should be smaller 
than 100 chars (in both main and test) d. some verification code are left as 
comments, please find a way to move them to unit tests e. organize imports into 
groups and order them alphabetically within each group f. do not add return or 
; unless they are necessary
    [bgreeven] OK, I can do that. B.T.W. it seems that the Spark Code Style 
Guide is missing some rules. I would be happy to volunteer expanding the Style 
Guide, also since "sbt/sbt scalastyle" enforces some rules (such as mandatory 
spaces before and after '+') that are not mentioned in the Style Guide.
    
      1.  Please use existing unit tests as templates. For example, please 
rename TestParallelANN to ParallelANNSuite and use LocalSparkContext and 
FunSuite for Spark setup and test assertions. Remove main() from unit tests.
    [bgreeven] OK, I will look at this and see how to convert the demo to a 
unit test.
    
      1.  Is Parallel necessary in the name ParallelANN?
    [bgreeven] Not really. Better naming is desirable indeed.
    
      1.  What methods and classes are public? Could they be private or package 
private? Please generate the API doc and check the public ones.
    [bgreeven] Yes I found out about this too. Some classes and methods need to 
be made public, as they currently cannot be access from outside. Maybe adding a 
Scala Object as interface (as is done in Alexander's code) is indeed better.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

Reply via email to