Ranjana,

have a look at this example http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Since you have a lot of RAM, you may not need to make all the classification pipeline out-of-core, a start with your current code could be to write a generator that loads and pre-processes the text in chunks then feed it one document at the time to CountVecotorizer.fit (it accepts an iterable). To reduce the memory usage, filtering too frequent tokens (instead of the infrequent ones) could help too. Make sure you L2 normalize your data before the classifier. You could use SGDClassifier(loss='log') or LogisticRegression with a sag or saga solver. The multiclasss="multinomial" parameter might be also worth trying, particularly since you have so many classes.

--
Roman

On 19/12/17 15:38, Ranjana Girish wrote:
Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

/# Importing the libraries/
/i*mport pandas as pd*/
/*import nltk*/
/*from nltk.corpus import stopwords*/
/*from nltk.tokenize import word_tokenize*/
/*from nltk.stem.wordnet import WordNetLemmatizer*/
/*from nltk.stem.porter import PorterStemmer*/
/*import re*/
/*from sklearn.feature_extraction.text import CountVectorizer*/
/*import random*/
/*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
/*from sklearn.metrics import accuracy_score*/
/*from sklearn.metrics import precision_recall_curve*/
/*from sklearn.metrics import average_precision_score*/
/*from sklearn import feature_selection*/
/*from scipy.sparse import csr_matrix*/
/*from scipy import sparse*/
/*import sys*/
/*from sklearn import preprocessing*/
/*import numpy as np*/
/*import pickle*/
/* */
/*sys.setrecursionlimit(200000000)*/
/*
*/
/*random.seed(20000)*/
/*
*/
/*
*/
/*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*/
/*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
"ISO-8859-1")*/
/*
*/
/*dataset=pd.concat([trainset1,trainset2])*/
/*
*/
/*dataset=dataset.dropna()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
/*
*/
/*del trainset1*/
/*del trainset2  */
/*
*/
/*stop = stopwords.words('english')*/
/*lemmatizer = WordNetLemmatizer()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*/
/*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*/
/*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
/*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
/*for tag in POS_LIST:*/
/*    dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
/*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
x : " ".join(x))*/
/*
*/
/*countvec = CountVectorizer(min_df=0.00008)*/
/*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
/*documenttermmatrix.shape*/
/*column=countvec.get_feature_names()*/
/*filename1 = 'columnnamessample10mastermerge.sav'*/
/*pickle.dump(column, open(filename1, 'wb'))*/
/*
*/
/*y_train=dataset['classpath']*/
/*y_train=dataset['classpath'].tolist()*/
/*labels_train= preprocessing.LabelEncoder()*/
/*labels_train.fit(y_train)*/
/*y1_train=labels_train.transform(y_train)*/
/*
*/
/*del dataset*/
/*del countvec*/
/*del column*/
/*
*/
/*
*/
/*clf = MultinomialNB()*/
/*model=clf.fit(documenttermmatrix,y_train)*/
/*
*/
/*
*/
/*
*/
*
*
/*
*/
/*filename2 = 'modelnaivebayessample10withfs.sav'*/
/*pickle.dump(model, open(filename2, 'wb'))*/
/
/
/
/
I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified
sampling* and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got*memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*


/
/
*I have stucked*
*
*
*
*

*Can Anyone please tell whether any memory leak in my code and  how to
use system with 128 GB RAM effectively*


Thanks
Ranjana



_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to