[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

etrain Thu, 06 Nov 2014 05:23:36 -0800

Github user etrain commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-61977531
  
    I've got several other comments on this PR - mostly good, and will leave 
some more detailed comments in a bit. TL;DR - What's wrong with Java 
Serialization?
    
    But - PMML support seems a little like overkill here. I admit I'm fairly 
ignorant to the details of PMML, but my understanding is that it is designed to 
facilitate transfer of models between languages - e.g. SAS to Java to R, etc. 
While I'm sure it's general purpose enough to capture complex pipelines, I'd be 
surprised if it can do so efficiently.
    
    @jegonzal and @dcrankshaw are talking about a java-based runtime 
environment for serving complex pipelines trained via spark and MLlib. Spark is 
already pretty good about shipping JVM code around to remote sites and 
executing it - it does so by serializing and shipping standard Java objects, 
and I don't see why we should follow a very different pattern here. This design 
has been part of MLlib since day 1.
    
    I don't want to speak for these two, but I don't think they have a problem 
having a runtime dependency on MLlib or some other JVM-based machine learning 
library, their main issue is that they don't want to have to call into a batch 
system (aka spark) to execute their models.
    
    @jkbradley - I agree that some models or transformers are going to require 
a lot of state, and potentially distributed computation, but this should be the 
exception, not the rule. In general, an Estimator should compute a fairly small 
object (Transformer) which is small enough to be passed around and doesn't need 
cluster resources to run. In the case of outlier removal, for example, I'd 
imagine that the Estimator would take several passes over the data and compute 
some sufficient statistics to remove new points. 
    
    For cases like ALS, where the model is really big, this is exactly where 
@jegonzal and @dcrankshaw's research comes in. To make an analogy - just as I'd 
happily use spark to compute a distributed inverted index, I certainly wouldn't 
use it to do point-queries on that index. So some interface for transmitting 
this large distributed state to a system more prepared to answer point queries 
based on that state is required.
    
    At any rate - this is great stuff, I just want to make sure we don't get 
lost in the weeds of supporting the general case at the expense of streamlined 
support for the obvious case.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to