Ranjana,
have a look at this example
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html
Since you have a lot of RAM, you may not need to make all the
classification pipeline out-of-core, a start with your current code
could be to write a generator that loads and pre-processes the text in
chunks then feed it one document at the time to CountVecotorizer.fit (it
accepts an iterable). To reduce the memory usage, filtering too frequent
tokens (instead of the infrequent ones) could help too. Make sure you L2
normalize your data before the classifier. You could use
SGDClassifier(loss='log') or LogisticRegression with a sag or saga
solver. The multiclasss="multinomial" parameter might be also worth
trying, particularly since you have so many classes.
--
Roman
On 19/12/17 15:38, Ranjana Girish wrote:
Hai all,
I am doing text classification. I have around 10 million data to be
classified to around 7k category.
Below is the code I am using
/# Importing the libraries/
/i*mport pandas as pd*/
/*import nltk*/
/*from nltk.corpus import stopwords*/
/*from nltk.tokenize import word_tokenize*/
/*from nltk.stem.wordnet import WordNetLemmatizer*/
/*from nltk.stem.porter import PorterStemmer*/
/*import re*/
/*from sklearn.feature_extraction.text import CountVectorizer*/
/*import random*/
/*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
/*from sklearn.metrics import accuracy_score*/
/*from sklearn.metrics import precision_recall_curve*/
/*from sklearn.metrics import average_precision_score*/
/*from sklearn import feature_selection*/
/*from scipy.sparse import csr_matrix*/
/*from scipy import sparse*/
/*import sys*/
/*from sklearn import preprocessing*/
/*import numpy as np*/
/*import pickle*/
/* */
/*sys.setrecursionlimit(200000000)*/
/*
*/
/*random.seed(20000)*/
/*
*/
/*
*/
/*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*/
/*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
"ISO-8859-1")*/
/*
*/
/*dataset=pd.concat([trainset1,trainset2])*/
/*
*/
/*dataset=dataset.dropna()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
/*
*/
/*del trainset1*/
/*del trainset2 */
/*
*/
/*stop = stopwords.words('english')*/
/*lemmatizer = WordNetLemmatizer()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*/
/*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*/
/*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
/*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
/*for tag in POS_LIST:*/
/* dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
/*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
x : " ".join(x))*/
/*
*/
/*countvec = CountVectorizer(min_df=0.00008)*/
/*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
/*documenttermmatrix.shape*/
/*column=countvec.get_feature_names()*/
/*filename1 = 'columnnamessample10mastermerge.sav'*/
/*pickle.dump(column, open(filename1, 'wb'))*/
/*
*/
/*y_train=dataset['classpath']*/
/*y_train=dataset['classpath'].tolist()*/
/*labels_train= preprocessing.LabelEncoder()*/
/*labels_train.fit(y_train)*/
/*y1_train=labels_train.transform(y_train)*/
/*
*/
/*del dataset*/
/*del countvec*/
/*del column*/
/*
*/
/*
*/
/*clf = MultinomialNB()*/
/*model=clf.fit(documenttermmatrix,y_train)*/
/*
*/
/*
*/
/*
*/
*
*
/*
*/
/*filename2 = 'modelnaivebayessample10withfs.sav'*/
/*pickle.dump(model, open(filename2, 'wb'))*/
/
/
/
/
I am using system with *128 GB RAM.*
As I was unable to train all 10 million data, I did *stratified
sampling* and the trainset reduced to 2.3 million
Still I was unable to Train 2.3 million data
I got*memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*
/
/
*I have stucked*
*
*
*
*
*Can Anyone please tell whether any memory leak in my code and how to
use system with 128 GB RAM effectively*
Thanks
Ranjana
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn