Thanks, Guru. After reading the implementation of StreamingKMean, StreamingLinearRegressionWithSGD and StreamingLogisticRegressionWithSGD, I reached the same conclusion. But unfortunately, this distinction between true online learning and offline learning are implied in the documentation and I was not sure if my understanding was correct or not. Thanks for confirming this!
However, I have a different opinion on your last paragraph — that we cannot hold test data during model training for online learning. Taking StreamingLinearRegressionWithSGD for example, you can certainly split the each micro-batch as 70% — 30% and do evaluation based on the RMSE. At the very beginning, the RMSE will be large. But as more and more micro-batch arrives, you should see RMSE becomes smaller as the weights approach optimal. IMHO, I don’t see much difference regarding holding test data between online and offline learning. Lan > On Mar 6, 2016, at 2:43 AM, Chris Miller <cmiller11...@gmail.com> wrote: > > Guru:This is a really great response. Thanks for taking the time to explain > all of this. Helpful for me too. > > > -- > Chris Miller > > On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani <gdm...@gmail.com > <mailto:gdm...@gmail.com>> wrote: > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support online > machine learning as you mentioned. Online machine learning is where model is > being trained and updated on every batch of streaming data. These models have > trainOn() and predictOn() methods where you can simply pass in DStreams you > want to train the model on and DStreams you want the model to predict on. So > when the next batch of data arrives model is trained and updated again. In > this case model weights are continually updated and hopefully model performs > better in terms of convergence and accuracy over time. What we are really > trying to do in online learning case is that we are only showing few examples > of the data at a time ( stream of data) and updating the parameters in case > of Linear and Logistic Regression and updating the centers in case of > K-Means. In the case of Linear or Logistic Regression this is possible due to > the optimizer that is chosen for minimizing the cost function which is > Stochastic Gradient Descent. This optimizer helps us to move closer and > closer to the optimal weights after every batch and over the time we will > have a model that has learned how to represent our data and predict well. > > In the scenario of using any MLlib algorithms and doing training with > DStream.transform() and DStream.foreachRDD() operations, when the first batch > of data arrives we build a model, let’s call this model1. Once you have the > model1 you can make predictions on the same DStream or a different DStream > source. But for the next batch if you follow the same procedure and create a > model, let’s call this model2. This model2 will be significantly different > than model1 based on how different the data is in the second DStream vs the > first DStream as it is not continually updating the model. It’s like weight > vectors are jumping from one place to the other for every batch and we never > know if the algorithm is converging to the optimal weights. So I believe it > is not possible to do true online learning with other MLLib models in Spark > Streaming. I am not sure if this is because the models don’t generally > support this streaming scenarios or if the streaming versions simply haven’t > been implemented yet. > > Though technically you can use any of the MLlib algorithms in Spark Streaming > with the procedure you mentioned and make predictions, it is important to > figure out if the model you are choosing can converge by showing only a > subset(batches - DStreams) of the data over time. Based on the algorithm you > choose certain optimizers won’t necessarily be able to converge by showing > only individual data points and require to see majority of the data to be > able to learn optimal weights. In these cases, you can still do offline > learning/training with Spark bach processing using any of the MLlib > algorithms and save those models on hdfs. You can then start a streaming job > and load these saved models into your streaming application and make > predictions. This is traditional offline learning. > > In general, online learning is hard as it’s hard to evaluate since we are not > holding any test data during the model training. We are simply training the > model and predicting. So in the initial batches, results can vary quite a bit > and have significant errors in terms of the predictions. So choosing online > learning vs. offline learning depends on how much tolerance the application > can have towards wild predictions in the beginning. Offline training is > simple and cheap where as online training can be hard and needs to be > constantly monitored to see how it is performing. > > Hope this helps in understanding offline learning vs. online learning and > which algorithms you can choose for online learning in MLlib. > > Guru Medasani > gdm...@gmail.com <mailto:gdm...@gmail.com> > > > > > On Mar 5, 2016, at 7:37 PM, Lan Jiang <ljia...@gmail.com > > <mailto:ljia...@gmail.com>> wrote: > > > > Hi, there > > > > I hope someone can clarify this for me. It seems that some of the MLlib > > algorithms such as KMean, Linear Regression and Logistics Regression have a > > Streaming version, which can do online machine learning. But does that mean > > other MLLib algorithm cannot be used in Spark streaming applications, such > > as random forest, SVM, collaborate filtering, etc?? > > > > DStreams are essentially a sequence of RDDs. We can use DStream.transform() > > and DStream.foreachRDD() operations, which allows you access RDDs in a > > DStream and apply MLLib functions on them. So it looks like all MLLib > > algorithms should be able to run in the streaming application. Am I wrong? > > > > Lan > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > <mailto:user-unsubscr...@spark.apache.org> > > For additional commands, e-mail: user-h...@spark.apache.org > > <mailto:user-h...@spark.apache.org> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >