2012/7/25 Fred Mailhot <[email protected]>: > Hi all, > > I've got a text classification problem on which LogisticRegression > consistently outperforms SGDClassifier(loss="log") by a few percentage > points on the smallish [O(10^5) points] datasets I've been using for initial > development/testing. The data set I'll ultimately be using for training is > big enough [O(10^9) to begin but incrementally increasing from there] that > I'll want to do online learning with SGDClassifier.partial_fit()... > > What I want to know is whether I can train an initial LogisticRegression > classifier, then use its coef_ to initialize a SGDClassifier(loss="log") > that would subsequently be updated via partial_fit() as new/more data come > in? Or is there stuff going on under the hood that would preclude this?
Alternatively you could try to run k independent SGDClassifier on random permutations / subsample of the dataset and average the coef_ vectors to reduce the impact of the stochasticity of SGDClassifer at the end of the convergence and decrease the variance of the estimate and improve the test set error. This is not real Averaged SGD but that might still be better than a single run. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
