You can use your groupId as a grid parameter, filter your dataset using
this id in a pipeline stage, before feeding it to the model.
The following may help:
http://spark.apache.org/docs/latest/ml-tuning.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder
The above should work ,but I haven't tried it myself. What I have tried is
the following Embarrassingly Parallel architecture (as TensorFlow was a
requirement in the use case):
See a PySpark/TensorFlow example here:
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
A relevant excerpt from the notebook mentioned above:
http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0,
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments,
numSlices=len(grouped_experiments))
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()
Again, like above, you use your groupId as a parameter in the grid search;
it works if your full dataset fits in the memory of a single machine. You
can broadcast the dataset in a compressed format and do the preprocessing
and feature engineering after you've done the filtering on groupId to
maximize the size of the dataset that can use this modeling approach.
Masood
--
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh
De :Xiaomeng Wan
A : User
Date : 2016-11-29 11:54
Objet : build models in parallel
I want to divide big data into groups (eg groupby some id), and build one
model for each group. I am wondering whether I can parallelize the model
building process by implementing a UDAF (eg running linearregression in
its evaluate mothod). is it good practice? anybody has experience? Thanks!
Regards,
Shawn