Re: [scikit-learn] Text classification of large dataet

Roman Yurchak Wed, 20 Dec 2017 10:35:15 -0800

Ranjana,

have a look at this examplehttp://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Since you have a lot of RAM, you may not need to make all theclassification pipeline out-of-core, a start with your current codecould be to write a generator that loads and pre-processes the text inchunks then feed it one document at the time to CountVecotorizer.fit (itaccepts an iterable). To reduce the memory usage, filtering too frequenttokens (instead of the infrequent ones) could help too. Make sure you L2normalize your data before the classifier. You could useSGDClassifier(loss='log') or LogisticRegression with a sag or sagasolver. The multiclasss="multinomial" parameter might be also worthtrying, particularly since you have so many classes.


--
Roman

On 19/12/17 15:38, Ranjana Girish wrote:

Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

/# Importing the libraries/
/i*mport pandas as pd*/
/*import nltk*/
/*from nltk.corpus import stopwords*/
/*from nltk.tokenize import word_tokenize*/
/*from nltk.stem.wordnet import WordNetLemmatizer*/
/*from nltk.stem.porter import PorterStemmer*/
/*import re*/
/*from sklearn.feature_extraction.text import CountVectorizer*/
/*import random*/
/*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
/*from sklearn.metrics import accuracy_score*/
/*from sklearn.metrics import precision_recall_curve*/
/*from sklearn.metrics import average_precision_score*/
/*from sklearn import feature_selection*/
/*from scipy.sparse import csr_matrix*/
/*from scipy import sparse*/
/*import sys*/
/*from sklearn import preprocessing*/
/*import numpy as np*/
/*import pickle*/
/* */
/*sys.setrecursionlimit(200000000)*/
/*
*/
/*random.seed(20000)*/
/*
*/
/*
*/
/*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*/
/*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
"ISO-8859-1")*/
/*
*/
/*dataset=pd.concat([trainset1,trainset2])*/
/*
*/
/*dataset=dataset.dropna()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
/*
*/
/*del trainset1*/
/*del trainset2  */
/*
*/
/*stop = stopwords.words('english')*/
/*lemmatizer = WordNetLemmatizer()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*/
/*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*/
/*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
/*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
/*for tag in POS_LIST:*/
/*    dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
/*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
x : " ".join(x))*/
/*
*/
/*countvec = CountVectorizer(min_df=0.00008)*/
/*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
/*documenttermmatrix.shape*/
/*column=countvec.get_feature_names()*/
/*filename1 = 'columnnamessample10mastermerge.sav'*/
/*pickle.dump(column, open(filename1, 'wb'))*/
/*
*/
/*y_train=dataset['classpath']*/
/*y_train=dataset['classpath'].tolist()*/
/*labels_train= preprocessing.LabelEncoder()*/
/*labels_train.fit(y_train)*/
/*y1_train=labels_train.transform(y_train)*/
/*
*/
/*del dataset*/
/*del countvec*/
/*del column*/
/*
*/
/*
*/
/*clf = MultinomialNB()*/
/*model=clf.fit(documenttermmatrix,y_train)*/
/*
*/
/*
*/
/*
*/
*
*
/*
*/
/*filename2 = 'modelnaivebayessample10withfs.sav'*/
/*pickle.dump(model, open(filename2, 'wb'))*/
/
/
/
/
I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified
sampling* and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got*memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*


/
/
*I have stucked*
*
*
*
*

*Can Anyone please tell whether any memory leak in my code and  how to
use system with 128 GB RAM effectively*


Thanks
Ranjana



_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Text classification of large dataet

Reply via email to