Another (hackish) idea to try would be to keep the labels of the extra data bit give it a sample_weight low enough not to override your good training data.
On 09.07.2012, at 12:43, Philipp Singer <[email protected]> wrote: > Hey! > > I am currently doing text classification. I have the following setup: > > 78 classes > max 1500 train examples per class > overall around 90.000 train examples > same amount of test examples > > I am pretty happy with the classification results (~52% f1 score) which > is fine for my task. > > But now I have another scenario. I have around 2.000.000 extra training > examples available which are produced by a certain amount of users not > _directly_ corresponding for the classes but I still know the labels of > this data. If I train the classifier simply on this extra data (without > the correct one) I can achieve a F1-score of ~25%. So this somehow tells > me that there is information available that I now somehow want to > incorporate to my existing data. For some few classes this data even > works slightly better or at least similar. > > I have simply tried to combine both datasets (90.000 + 2.000.000) but > this makes the results worse (test data amount always stays the same). > This is not surprising because a lot of noise is added to the data and I > think that the huge amount of extra data somehow overrules the existing one. > > My question now is, how I can incorporate this data the best in order to > achieve better classification results than with my first dataset. Maybe > someone has an idea or there are some techniques for that. > > Just for the record: I use Tf-Idf with a SVC which works best. I have > also tried a different approach using topic models. > > Thanks and many regards, > Philipp > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
