Thanks for this conclusive answer, Andy! Once more I am impressed how well 
thought out scikit-learn really is!

Best,
Sebastian


> On Apr 3, 2015, at 2:14 PM, Andy <t3k...@gmail.com> wrote:
> 
> On 04/03/2015 01:39 PM, Sebastian Raschka wrote:
>> Was there are reason to implement the SGDClassifier with shuffling after 
>> each epoch rather than selecting a random training sample in each iteration? 
>> In terms of computational efficiency it should make a difference to generate 
>> a random number at each iteration or shuffle the training set once per 
>> epoch, right?
> Actually, we added the default shuffling recently.
> Arguably, the "standard" implementation of SGD is to shuffle the data 
> once before training, and never again.
> Proofs are usually done using iid sampling of points, though the plots 
> in the papers are usually done using either shuffling once at the 
> beginning or after each epoch,
> as that usually works better.
> If you look at talks by Francis Bach or Shalev-Shwartz they often 
> mention it and I also heard them say that they don't know why it works 
> better if the theory says to use iid sampling.
> I've actually seen a talk by Shalev-Shwartz where he was surprised that 
> he had to shuffle between iterations to get the convergence he proved.
> 
> The reason why you want to shuffle at least once in the beginning is 
> that SGD breaks if the data is sorted, say, by labels.
> We benched shuffling once vs shuffling every time, and the runtime is 
> basically the same, opted for shuffling every time.
> 
> Maybe mblondel or larsmans can give better answers, though ;)
> 
> Hth,
> Andy
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to