2012/7/25 Fred Mailhot <[email protected]>:
> Hi all,
>
> I've got a text classification problem on which LogisticRegression
> consistently outperforms SGDClassifier(loss="log") by a few percentage
> points on the smallish [O(10^5) points] datasets I've been using for initial
> development/testing. The data set I'll ultimately be using for training is
> big enough [O(10^9) to begin but incrementally increasing from there] that
> I'll want to do online learning with SGDClassifier.partial_fit()...
>
> What I want to know is whether I can train an initial LogisticRegression
> classifier, then use its coef_ to initialize a SGDClassifier(loss="log")
> that would subsequently be updated via partial_fit() as new/more data come
> in? Or is there stuff going on under the hood that would preclude this?

Alternatively you could try to run k independent SGDClassifier on
random permutations / subsample of the dataset and average the coef_
vectors to reduce the impact of the stochasticity of SGDClassifer at
the end of the convergence and decrease the variance of the estimate
and improve the test set error.

This is not real Averaged SGD but that might still be better than a single run.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to