Github user BryanCutler commented on the pull request:
https://github.com/apache/spark/pull/6300#issuecomment-120549044
Hi @jkbradley , thanks for checking this out! I'm not sure I understand a
couple things from your suggestion.
If a subclass implements `transformImpl(dataset: DataFrame)`, then
broadcasts and proceeds with a`dataset.map(...)` then the result is now an RDD
which would have to be made back into a DataFrame to return. This seems like
an inefficient step, which is why I tried to just stick with DataFrames.
Also, the model parameters need to be accessed from inside
`predict(features: Double)` of a subclass, like
RandomForestClassificationModel, so the only way to do this is to change the
signature of `predict` to have the broadcast var as a parameter, or make the
broadcast var a member of RandomForestClassifier. Both of those seemed like
bad ideas, which is why I added `predictImpl` that could share the same code
for both broadcasted and non-broadcasted models.
Sorry, maybe I am missing something, could you elaborate more on how you
were thinking of using the broadcast variable in a map?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]