Comments inline. -Mani Kumar
On Tue, Dec 29, 2009 at 3:14 AM, Robin Anil <[email protected]> wrote: > with a 50K set, you may/may not loose out on some features. Depends > entirely > on the data. If you dont mind answering - What is the number of categories > that you have? > ~50 categories > > I agree that re-training 1 million docs is cumbersome. But if i remember > correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4 > HT > systems in 20 mins. -- thats fast. > I dont know how big your data or how big your cluster > is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am > blind and have no sense of what is big after working at google). I say, try > and estimate it yourself. -- daily 1 hour is not an issue but daily 6-8 hours will be an issue. > > On the other hand. You could also try a dual fold approach. A sturdy 1 > million docs trained classifier and recent 50K docs classifier. And do some > form of voting. > > I am sure you will not be able to load the 1mil model in to memory, you > might need to use Hbase there. Instead you can use 50K model in memory for > fast classification. Then run a batch classification job daily to > re-classify your dataset based on the 1mil model > -- yes, i'll have to use hbase only. thanks! > > Robin > > > > On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <[email protected] > >wrote: > > > Thanks for the quick response. > > > > @Robin absolutely agree on your suggestion regarding using 600 docs for > > monitoring performance. > > > > lets talk about bigger numbers e.g. i have more than 1 million docs and i > > get 10k new docs every day out of which 6k is already classified. > > > > Monitoring performance is good but it can be done weekly instead of daily > > just to reduce cost. > > > > I actually wanted to avoid the retraining as much as possible because it > > comes with huge cost for large dataset. > > > > Better solution could that we'll use 50k docs from every category order > by > > created_at desc, to reduce the amount of data and stay tuned with latest > > trends. > > > > Thanks a lot guys. > > > > -Mani Kumar > > > > On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <[email protected]> > > wrote: > > > > > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <[email protected]> > > wrote: > > > > > > > Long answer, You can use your 600 docs to test the classifier and see > > > your > > > > accuracy. Then retrain with the entire documents and then test a test > > > data > > > > set. So daily you can choose to include or exclude the 600 documents > > that > > > > come and ensure that you keep your classifier at the top performance. > > > > After > > > > some amount of documents, you dont get much benefit of retraining. > > > Further > > > > training would only add over fitting errors. > > > > > > > > > > The suggestion that the 600 new documents be used to monitor > performance > > is > > > an excellent one. > > > > > > It should be pretty easy to add the "train on incremental data" option > to > > > K-means. > > > > > > Also, the k-means algorithm definitely will reach a point of > diminishing > > > returns, but it should be very resistant to over training. > > > > > >
