2012/7/25 Abhi <[email protected]>: > > Hello, > Sorry for getting back late..I originally had experimented with different > classifiers including SGDClassifier, it seemed faster but much less accurate, > about 93% for 30000 emails[and decreasing as the number of emails increases], > but have not tried with the incremental approach. Will try it next.
The incremental approach should not change the outcome. It will just make it possible not to load all of the data ahead of time. Have a look at the sklearn/linear_model/sgd.py source code to know exactly what the difference is. For the accuracy issue I suspect that you used the default n_iter value? Maybe you can try to increase or reduce that value. Also you can try alternative parameters for the learning rate (see the doc for more details). http://scikit-learn.org/stable/modules/sgd.html > During this time I have been facing seg faults in LinearSVC. If I use bi- > grams in the vectorizer,the memory usage increases to more than double, and I > get a segfault in classifier.fit. I tried reducing the number of features so > as > to reduce the size using SelectKBest, (as shown in http://scikit- > learn.org/stable/auto_examples/document_classification_20newsgroups.html) > > From the test run: > I use TfidfVectorizer to extract features from test and training dataset. > > [Train data] n_samples: 47237, n_features: 3118889 > [Test data] n_samples: 23974, n_features: 3118889 > > After which I am selecting the k-best features, but get a segfault at > ch2 = SelectKBest(chi2, k=500) > --> data_train = ch2.fit_transform(data_train, self.train_target) > > I used this method since I read somewhere [I came accross the reference, but > did forgot to mark the link], that chi-squared test would be best for > extracting best of sparse features. Does that or my approach seem correct? > Thank you for the responses and your valuable input, Could you please try to come up with one or two minimalistic reproduction scripts for the ch2.fit_transform and LinearSVC.fit segfaults? Is it just that it is exhausting memory on your system? Are you running a 32bit or a 64bit OS? How much physical memory do you have on your machine? You can push the scripts on https://gist.github.com (note those are git repositories so that you can push data files using git there too). As a temporary workaround you can pass a max_features parameter to the vectorizer so as to limit the feature mapping to the top most frequent features over the training corpus. For instance you can try values in the range max_features=100000 or max_feature=1000000. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
