Hi, all, I have a question about the implementation of stochastic gradient descent in the SGDClassifier. Based on the documentation (at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) it sounds like that the n_iter parameter refers to an the entire pass of the training dataset, which is shuffled after each epoch via `shuffle=True`.
This is basically in stochastic gradient descent algorithm like below (from Wikipedia): • Choose an initial vector of weight and learning rate • Randomly shuffle examples in the training set. • Repeat until an approximate minimum is obtained: • for i=1,2..., n do: • calc. gradient + weight update However, in the "standard" implementation of SGD, isn't each training sample picked randomly for each iteration (where iteration means gradient step, not epoch), which would be random sampling with replacement? • Choose an initial vector of weight and learning rate • Repeat until an approximate minimum is obtained or maximum number of iterations is reached: • Randomly select one sample in the training set. • calc. gradient + weight update The random shuffling after each iteration is mentioned in, e.g., : Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." Proceedings of COMPSTAT'2010. Physica-Verlag HD, 2010. 177-186. Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. Was there are reason to implement the SGDClassifier with shuffling after each epoch rather than selecting a random training sample in each iteration? In terms of computational efficiency it should make a difference to generate a random number at each iteration or shuffle the training set once per epoch, right? Best, Sebastian ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general