[GitHub] spark pull request: [SPARK-7127] [MLLIB] [WIP] Adding broadcast of...

BryanCutler Fri, 10 Jul 2015 16:19:38 -0700

Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/6300#issuecomment-120549044
  
    Hi @jkbradley , thanks for checking this out!  I'm not sure I understand a 
couple things from your suggestion.
    
    If a subclass implements `transformImpl(dataset: DataFrame)`, then 
broadcasts and proceeds with a`dataset.map(...)` then the result is now an RDD 
which would have to be made back into a DataFrame to return.  This seems like 
an inefficient step, which is why I tried to just stick with DataFrames.
    
    Also, the model parameters need to be accessed from inside 
`predict(features: Double)` of a subclass, like 
RandomForestClassificationModel, so the only way to do this is to change the 
signature of `predict` to have the broadcast var as a parameter, or make the 
broadcast var a member of RandomForestClassifier.  Both of those seemed like 
bad ideas, which is why I added `predictImpl` that could share the same code 
for both broadcasted and non-broadcasted models.
    
    Sorry, maybe I am missing something, could you elaborate more on how you 
were thinking of using the broadcast variable in a map?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7127] [MLLIB] [WIP] Adding broadcast of...

Reply via email to