Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Guillaume Lemaître
I think that you could you use imbalanced-learn regarding the issue that
you have with the y.
You should be able to wrap your clustering inside the FunctionSampler (
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are
on the way to merge it)

On 19 December 2017 at 13:44, Manuel Castejón Limas <
manuel.caste...@gmail.com> wrote:

> Dear all,
>
> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
> able to transform anything other than X.
>
> My current study case would need:
> - Transformers being able to handle both X and y, e.g. clustering X and y
> concatenated
> - Pipeline being able to change other params, e.g. sample_weight
>
> Currently, I'm augmenting X through every step with the extra information
> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
> can inherit and modify a descendant from Pipeline class to allow the y
> parameter which is not ideal but I guess it is an option. The gritty part
> comes when having to adapt every regressor at the end of the ladder in
> order to split the extra information from the raw data in X and not being
> able to generate more than one subproduct from each preprocessing step
>
> My current research involves clustering the data and using that
> classification along with X in order to predict outliers which generates
> sample_weight info and I would love to use that on the final regressor.
> Currently there seems not to be another option than pasting that info on X.
>
> All in all, I'm stuck with this API limitation and I would love to learn
> some tricks from you if you could enlighten me.
>
> Thanks in advance!
>
> Manuel Castejón-Limas
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Christos Aridas
Hey Manuel,

In imbalanced-learn we have an extra type of estimators, named Samplers,
which are able to modify X and y, at the same time, with the use of new API
methods, sample and fit_sample.
Also, we have adopted a modified version of scikit-learn's Pipeline class
where we allow subsequent transformations using samplers and transformers.
Despite the fact that the package deals with imbalanced datasets the
aforementioned objects may help your pipeline.

Cheerz,
Chris

On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castejón Limas <
manuel.caste...@gmail.com> wrote:

> Dear all,
>
> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
> able to transform anything other than X.
>
> My current study case would need:
> - Transformers being able to handle both X and y, e.g. clustering X and y
> concatenated
> - Pipeline being able to change other params, e.g. sample_weight
>
> Currently, I'm augmenting X through every step with the extra information
> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
> can inherit and modify a descendant from Pipeline class to allow the y
> parameter which is not ideal but I guess it is an option. The gritty part
> comes when having to adapt every regressor at the end of the ladder in
> order to split the extra information from the raw data in X and not being
> able to generate more than one subproduct from each preprocessing step
>
> My current research involves clustering the data and using that
> classification along with X in order to predict outliers which generates
> sample_weight info and I would love to use that on the final regressor.
> Currently there seems not to be another option than pasting that info on X.
>
> All in all, I'm stuck with this API limitation and I would love to learn
> some tricks from you if you could enlighten me.
>
> Thanks in advance!
>
> Manuel Castejón-Limas
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Eager to learn! Diving on the code right now!

Thanks for the tip!
Manuel

2017-12-19 14:18 GMT+01:00 Guillaume Lemaître :

> I think that you could you use imbalanced-learn regarding the issue that
> you have with the y.
> You should be able to wrap your clustering inside the FunctionSampler (
> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
> are on the way to merge it)
>
> On 19 December 2017 at 13:44, Manuel Castejón Limas <
> manuel.caste...@gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castejón-Limas
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Wow, that seems promising. I'll read with interest the imbalance-learn code.
Thanks for the info!
Manuel


2017-12-19 14:15 GMT+01:00 Christos Aridas :

> Hey Manuel,
>
> In imbalanced-learn we have an extra type of estimators, named Samplers,
> which are able to modify X and y, at the same time, with the use of new API
> methods, sample and fit_sample.
> Also, we have adopted a modified version of scikit-learn's Pipeline class
> where we allow subsequent transformations using samplers and transformers.
> Despite the fact that the package deals with imbalanced datasets the
> aforementioned objects may help your pipeline.
>
> Cheerz,
> Chris
>
> On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castejón Limas <
> manuel.caste...@gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castejón-Limas
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*

*sys.setrecursionlimit(2)*

*random.seed(2)*


*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*

*dataset=pd.concat([trainset1,trainset2])*

*dataset=dataset.dropna()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.8)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*

*del dataset*
*del countvec*
*del column*


*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*





*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*


I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*



*I have stucked*



*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*


Thanks
Ranjana
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jeffrey Levesque via scikit-learn
Hi guys,
I'm currently developing a web-interface, and programmatic rest-API for 
sklearn. I currently have SVM, and SVR available with some parameters like C, 
and gamma exposed:

- https://github.com/jeff1evesque/machine-learning

I'm working a bit to improve the web-interface at the moment. Since you're 
working with SVM's maybe you'd have time, to try my project, and to provide me 
some feedback? I hope to expand the toolset to things like ensemble learning, 
and a long shot of neural network. But, this may be some time.

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Dec 19, 2017, at 4:37 PM, Jacob Vanderplas  
> wrote:
> 
> Hi JohnMark,
> SVMs, by design, are quite sensitive to the addition of single data points – 
> but only if those data points happen to lie near the margin. I wrote about 
> some of those types of details here: 
> https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html
>  
> 
> Hope that helps,
>Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
>> On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark 
>>  wrote:
>> Hello,
>> 
>> I am a researcher in fMRI and am using SVMs to analyze brain data. I am 
>> doing decoding between two classes, each of which has 24 exemplars per 
>> class. I am comparing two different methods of cross-validation for my data: 
>> in one, I am training on 23 exemplars from each class, and testing on the 
>> remaining example from each class, and in the other, I am training on 22 
>> exemplars from each class, and testing on the remaining two from each class 
>> (in case it matters, the data is structured into different neuroimaging 
>> "runs", with each "run" containing several "blocks"; the first 
>> cross-validation method is leaving out one block at a time, the second is 
>> leaving out one run at a time). 
>> 
>> Now, I would've thought that these two CV methods would be very similar, 
>> since the vast majority of the training data is the same; the only 
>> difference is in adding two additional points. However, they are yielding 
>> very different results: training on 23 per class is yielding 60% decoding 
>> accuracy (averaged across several subjects, and statistically significantly 
>> greater than chance), training on 22 per class is yielding chance (50%) 
>> decoding. Leaving aside the particulars of fMRI in this case: is it unusual 
>> for single points (amounting to less than 5% of the data) to have such a big 
>> influence on SVM decoding? I am using a cost parameter of C=1. I must say it 
>> is counterintuitive to me that just a couple points out of two dozen could 
>> make such a big difference.
>> 
>> Thank you very much, and cheers,
>> 
>> JohnMark
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Gael Varoquaux
With as few data points, there is a huge uncertainty in the estimation of
the prediction accuracy with cross-validation. This isn't a problem of
the method, is it a basic limitation of the small amount of data. I've
written a paper on this problem is the specific context of neuroimaging:
https://www.sciencedirect.com/science/article/pii/S1053811917305311
(preprint: https://hal.inria.fr/hal-01545002/).

I except that what you are seing in sampling noise: the result has
confidence intervals in large than 10%.

Gaël


On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote:
> Hello,

> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing
> decoding between two classes, each of which has 24 exemplars per class. I am
> comparing two different methods of cross-validation for my data: in one, I am
> training on 23 exemplars from each class, and testing on the remaining example
> from each class, and in the other, I am training on 22 exemplars from each
> class, and testing on the remaining two from each class (in case it matters,
> the data is structured into different neuroimaging "runs", with each "run"
> containing several "blocks"; the first cross-validation method is leaving out
> one block at a time, the second is leaving out one run at a time). 

> Now, I would've thought that these two CV methods would be very similar, since
> the vast majority of the training data is the same; the only difference is in
> adding two additional points. However, they are yielding very different
> results: training on 23 per class is yielding 60% decoding accuracy (averaged
> across several subjects, and statistically significantly greater than chance),
> training on 22 per class is yielding chance (50%) decoding. Leaving aside the
> particulars of fMRI in this case: is it unusual for single points (amounting 
> to
> less than 5% of the data) to have such a big influence on SVM decoding? I am
> using a cost parameter of C=1. I must say it is counterintuitive to me that
> just a couple points out of two dozen could make such a big difference.

> Thank you very much, and cheers,

> JohnMark

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Taylor, Johnmark
Hello,

I am a researcher in fMRI and am using SVMs to analyze brain data. I am
doing decoding between two classes, each of which has 24 exemplars per
class. I am comparing two different methods of cross-validation for my
data: in one, I am training on 23 exemplars from each class, and testing on
the remaining example from each class, and in the other, I am training on
22 exemplars from each class, and testing on the remaining two from each
class (in case it matters, the data is structured into different
neuroimaging "runs", with each "run" containing several "blocks"; the first
cross-validation method is leaving out one block at a time, the second is
leaving out one run at a time).

Now, I would've thought that these two CV methods would be very similar,
since the vast majority of the training data is the same; the only
difference is in adding two additional points. However, they are yielding
very different results: training on 23 per class is yielding 60% decoding
accuracy (averaged across several subjects, and statistically significantly
greater than chance), training on 22 per class is yielding chance (50%)
decoding. Leaving aside the particulars of fMRI in this case: is it unusual
for single points (amounting to less than 5% of the data) to have such a
big influence on SVM decoding? I am using a cost parameter of C=1. I must
say it is counterintuitive to me that just a couple points out of two dozen
could make such a big difference.

Thank you very much, and cheers,

JohnMark
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Joel Nothman
At a glance, and perhaps not knowing imbalanced-learn well enough, I have
some doubts that it will provide an immediate solution for all your needs.

At the end of the day, the Pipeline keeps its scope relatively tight, but
it should not be so hard to implement something for your own needs if your
case does not fit what Pipeline supports.

On 20 December 2017 at 00:34, Manuel Castejón Limas <
manuel.caste...@gmail.com> wrote:

> Eager to learn! Diving on the code right now!
>
> Thanks for the tip!
> Manuel
>
> 2017-12-19 14:18 GMT+01:00 Guillaume Lemaître :
>
>> I think that you could you use imbalanced-learn regarding the issue that
>> you have with the y.
>> You should be able to wrap your clustering inside the FunctionSampler (
>> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
>> are on the way to merge it)
>>
>> On 19 December 2017 at 13:44, Manuel Castejón Limas <
>> manuel.caste...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> Kudos to scikit-learn! Having said that, Pipeline is killing me not
>>> being able to transform anything other than X.
>>>
>>> My current study case would need:
>>> - Transformers being able to handle both X and y, e.g. clustering X and
>>> y concatenated
>>> - Pipeline being able to change other params, e.g. sample_weight
>>>
>>> Currently, I'm augmenting X through every step with the extra
>>> information which seems to work ok for 
>>> my_pipe.fit_transform(X_train,y_train)
>>> but breaks on my_pipe.transform(X_test) for the lack of the y parameter.
>>> Ok, I can inherit and modify a descendant from Pipeline class to allow the
>>> y parameter which is not ideal but I guess it is an option. The gritty part
>>> comes when having to adapt every regressor at the end of the ladder in
>>> order to split the extra information from the raw data in X and not being
>>> able to generate more than one subproduct from each preprocessing step
>>>
>>> My current research involves clustering the data and using that
>>> classification along with X in order to predict outliers which generates
>>> sample_weight info and I would love to use that on the final regressor.
>>> Currently there seems not to be another option than pasting that info on X.
>>>
>>> All in all, I'm stuck with this API limitation and I would love to learn
>>> some tricks from you if you could enlighten me.
>>>
>>> Thanks in advance!
>>>
>>> Manuel Castejón-Limas
>>>
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jacob Vanderplas
Hi JohnMark,
SVMs, by design, are quite sensitive to the addition of single data points
– but only if those data points happen to lie near the margin. I wrote
about some of those types of details here:
https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html


Hope that helps,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Tue, Dec 19, 2017 at 1:27 PM, Taylor, Johnmark <
johnmarktay...@g.harvard.edu> wrote:

> Hello,
>
> I am a researcher in fMRI and am using SVMs to analyze brain data. I am
> doing decoding between two classes, each of which has 24 exemplars per
> class. I am comparing two different methods of cross-validation for my
> data: in one, I am training on 23 exemplars from each class, and testing on
> the remaining example from each class, and in the other, I am training on
> 22 exemplars from each class, and testing on the remaining two from each
> class (in case it matters, the data is structured into different
> neuroimaging "runs", with each "run" containing several "blocks"; the first
> cross-validation method is leaving out one block at a time, the second is
> leaving out one run at a time).
>
> Now, I would've thought that these two CV methods would be very similar,
> since the vast majority of the training data is the same; the only
> difference is in adding two additional points. However, they are yielding
> very different results: training on 23 per class is yielding 60% decoding
> accuracy (averaged across several subjects, and statistically significantly
> greater than chance), training on 22 per class is yielding chance (50%)
> decoding. Leaving aside the particulars of fMRI in this case: is it unusual
> for single points (amounting to less than 5% of the data) to have such a
> big influence on SVM decoding? I am using a cost parameter of C=1. I must
> say it is counterintuitive to me that just a couple points out of two dozen
> could make such a big difference.
>
> Thank you very much, and cheers,
>
> JohnMark
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Feature selection with words.

2017-12-19 Thread Luigi Lomasto
Hi all. 

I’m working for text classification to classify Wikipedia documents. I using a 
word count approach to extract feature from my text so I obtain a big 
vocabulary that contains all documents word (train dataset) after lemmatization 
and deleted stop word. Now I have 7 features. I think that for this 
problems (word based) is not good to make feature selection (with SVD or PCA). 
Actual accuracy is 77%. 

Do you think that I need to do feature selection to grow up the accuracy? 

Thank you for answer. Regards. 

Luigi 



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Feature selection with words.

2017-12-19 Thread Joel Nothman
It depends what the set of classes is. Best way to find out is to try it...

On 19 December 2017 at 19:36, Luigi Lomasto <
l.loma...@innovationengineering.eu> wrote:

> Hi all.
>
> I’m working for text classification to classify Wikipedia documents. I
> using a word count approach to extract feature from my text so I obtain a
> big vocabulary that contains all documents word (train dataset) after
> lemmatization and deleted stop word. Now I have 7 features. I think
> that for this problems (word based) is not good to make feature selection
> (with SVD or PCA). Actual accuracy is 77%.
>
> Do you think that I need to do feature selection to grow up the accuracy?
>
> Thank you for answer. Regards.
>
> Luigi
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn