Thanks, Guru. After reading the implementation of StreamingKMean, 
StreamingLinearRegressionWithSGD and StreamingLogisticRegressionWithSGD, I 
reached the same conclusion. But unfortunately, this distinction between true 
online learning and offline learning are implied in the documentation and I was 
not sure if my understanding was correct or not. Thanks for confirming this!

However, I have a different opinion on your last paragraph —  that we cannot 
hold test data during model training for online learning. Taking 
StreamingLinearRegressionWithSGD for example, you can certainly split the each 
micro-batch as 70% — 30% and do evaluation based on the RMSE. At the very 
beginning, the RMSE will be large. But as more and more micro-batch arrives, 
you should see RMSE becomes smaller as the weights approach optimal. IMHO, I 
don’t see much difference regarding holding test data between online and 
offline learning.  

Lan

> On Mar 6, 2016, at 2:43 AM, Chris Miller <cmiller11...@gmail.com> wrote:
> 
> Guru:This is a really great response. Thanks for taking the time to explain 
> all of this. Helpful for me too.
> 
> 
> --
> Chris Miller
> 
> On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani <gdm...@gmail.com 
> <mailto:gdm...@gmail.com>> wrote:
> Hi Lan,
> 
> Streaming Means, Linear Regression and Logistic Regression support online 
> machine learning as you mentioned. Online machine learning is where model is 
> being trained and updated on every batch of streaming data. These models have 
> trainOn() and predictOn() methods where you can simply pass in DStreams you 
> want to train the model on and DStreams you want the model to predict on. So 
> when the next batch of data arrives model is trained and updated again. In 
> this case model weights are continually updated and hopefully model performs 
> better in terms of convergence and accuracy over time. What we are really 
> trying to do in online learning case is that we are only showing few examples 
> of the data at a time ( stream of data) and updating the parameters in case 
> of Linear and Logistic Regression and updating the centers in case of 
> K-Means. In the case of Linear or Logistic Regression this is possible due to 
> the optimizer that is chosen for minimizing the cost function which is 
> Stochastic Gradient Descent. This optimizer helps us to move closer and 
> closer to the optimal weights after every batch and over the time we will 
> have a model that has learned how to represent our data and predict well.
> 
> In the scenario of using any MLlib algorithms and doing training with 
> DStream.transform() and DStream.foreachRDD() operations, when the first batch 
> of data arrives we build a model, let’s call this model1. Once you have the 
> model1 you can make predictions on the same DStream or a different DStream 
> source. But for the next batch if you follow the same procedure and create a 
> model, let’s call this model2. This model2 will be significantly different 
> than model1 based on how different the data is in the second DStream vs the 
> first DStream as it is not continually updating the model. It’s like weight 
> vectors are jumping from one place to the other for every batch and we never 
> know if the algorithm is converging to the optimal weights. So I believe it 
> is not possible to do true online learning with other MLLib models in Spark 
> Streaming.  I am not sure if this is because the models don’t generally 
> support this streaming scenarios or if the streaming versions simply haven’t 
> been implemented yet.
> 
> Though technically you can use any of the MLlib algorithms in Spark Streaming 
> with the procedure you mentioned and make predictions, it is important to 
> figure out if the model you are choosing can converge by showing only a 
> subset(batches  - DStreams) of the data over time. Based on the algorithm you 
> choose certain optimizers won’t necessarily be able to converge by showing 
> only individual data points and require to see majority of the data to be 
> able to learn optimal weights.  In these cases, you can still do offline 
> learning/training with Spark bach processing using any of the MLlib 
> algorithms and save those models on hdfs. You can then start a streaming job 
> and load these saved models into your streaming application and make 
> predictions. This is traditional offline learning.
> 
> In general, online learning is hard as it’s hard to evaluate since we are not 
> holding any test data during the model training. We are simply training the 
> model and predicting. So in the initial batches, results can vary quite a bit 
> and have significant errors in terms of the predictions. So choosing online 
> learning vs. offline learning depends on how much tolerance the application 
> can have towards wild predictions in the beginning. Offline training is 
> simple and cheap where as online training can be hard and needs to be 
> constantly monitored to see how it is performing.
> 
> Hope this helps in understanding offline learning vs. online learning and 
> which algorithms you can choose for online learning in MLlib.
> 
> Guru Medasani
> gdm...@gmail.com <mailto:gdm...@gmail.com>
> 
> 
> 
> > On Mar 5, 2016, at 7:37 PM, Lan Jiang <ljia...@gmail.com 
> > <mailto:ljia...@gmail.com>> wrote:
> >
> > Hi, there
> >
> > I hope someone can clarify this for me.  It seems that some of the MLlib 
> > algorithms such as KMean, Linear Regression and Logistics Regression have a 
> > Streaming version, which can do online machine learning. But does that mean 
> > other MLLib algorithm cannot be used in Spark streaming applications, such 
> > as random forest, SVM, collaborate filtering, etc??
> >
> > DStreams are essentially a sequence of RDDs. We can use DStream.transform() 
> > and DStream.foreachRDD() operations, which allows you access RDDs in a 
> > DStream and apply MLLib functions on them. So it looks like all MLLib 
> > algorithms should be able to run in the streaming application. Am I wrong?
> >
> > Lan
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to