Hai all, I am doing text classification. I have around 10 million data to be classified to around 7k category.
Below is the code I am using *# Importing the libraries* *import pandas as pd* *import nltk* *from nltk.corpus import stopwords* *from nltk.tokenize import word_tokenize* *from nltk.stem.wordnet import WordNetLemmatizer* *from nltk.stem.porter import PorterStemmer* *import re* *from sklearn.feature_extraction.text import CountVectorizer* *import random* *from sklearn.naive_bayes import MultinomialNB,GaussianNB* *from sklearn.metrics import accuracy_score* *from sklearn.metrics import precision_recall_curve* *from sklearn.metrics import average_precision_score* *from sklearn import feature_selection* *from scipy.sparse import csr_matrix* *from scipy import sparse* *import sys* *from sklearn import preprocessing* *import numpy as np* *import pickle* *sys.setrecursionlimit(200000000)* *random.seed(20000)* *trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding = "ISO-8859-1")* *trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")* *dataset=pd.concat([trainset1,trainset2])* *dataset=dataset.dropna()* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.lower()* *del trainset1* *del trainset2 * *stop = stopwords.words('english')* *lemmatizer = WordNetLemmatizer()* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')* *dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+',' ')* *dataset['ProductDescription'] =dataset['ProductDescription'].apply(word_tokenize)* *ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'* *POS_LIST = [NOUN, VERB, ADJ, ADV]* *for tag in POS_LIST:* * dataset['ProductDescription'] = dataset['ProductDescription'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x])))* *dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x : " ".join(x))* *countvec = CountVectorizer(min_df=0.00008)* *documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])* *documenttermmatrix.shape* *column=countvec.get_feature_names()* *filename1 = 'columnnamessample10mastermerge.sav'* *pickle.dump(column, open(filename1, 'wb'))* *y_train=dataset['classpath']* *y_train=dataset['classpath'].tolist()* *labels_train= preprocessing.LabelEncoder()* *labels_train.fit(y_train)* *y1_train=labels_train.transform(y_train)* *del dataset* *del countvec* *del column* *clf = MultinomialNB()* *model=clf.fit(documenttermmatrix,y_train)* *filename2 = 'modelnaivebayessample10withfs.sav'* *pickle.dump(model, open(filename2, 'wb'))* I am using system with *128 GB RAM.* As I was unable to train all 10 million data, I did *stratified sampling* and the trainset reduced to 2.3 million Still I was unable to Train 2.3 million data I got* memory error* when i used *random forest (nestimator=30),**Naive Bayes* and *SVM* *I have stucked* *Can Anyone please tell whether any memory leak in my code and how to use system with 128 GB RAM effectively* Thanks Ranjana
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn