I apologize, but I don't follow how that will help much. The features are
obtained from the original dataset that is fed to scikit's vectorizers.
However, to have the vectorizers work properly they need to be stripped out. I
simply want to add them back in prior to any cv splitting. However, as you say
the gridseearchcv is applying a cross-validation split prior to applying the
vectorizers it seems like this is an impossible task without a complex method
of taking the indicies and reapplying them. Seems like it should be a
straightforward implementation as it seems strange to assume a dataset will
only utilize text and not external information.
Are there any existing solutions for this?
________________________________
From: Anders Aagaard [aagaa...@gmail.com]
Sent: Friday, August 22, 2014 9:32 AM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Pipeline - Convert to dense
If X2 doesn't have the same ordering you wouldn't be able to pass that directly
either. The data is split before the being run into the pipeline, so just using
hstack is fine.
I've got the code I use to make this easier here by the way :
https://github.com/andaag/scikit_helpers .
On Fri, Aug 22, 2014 at 7:44 AM, Sebastian Okser
<seo...@utu.fi<mailto:seo...@utu.fi>> wrote:
Hey,
Your tip put me very far in the right direction. I have one further question.
It seems that just appending the features in the featureunion as a pipeline
step may create havoc as when I implement gridsearchcv the data is going to
have some sort of randomness in the data order compared to the original
dataset. Therefore doing something simply like:
import numpy as np
class FeatureStacker():
"""Class adds features to the data matrix"""
def __init__(self,X2):
"""Initialization Function, X2 is the feature array of the features to
be added"""
self.X2 = X2
#Other required functions in a transformer
def transform(self,X):
"""Horizontally stacks elements X and X2"""
return np.hstack((X,self.X2))
Where X2 are the new features wouldn't necessarily have the same data ordering
and would hurt the training. Is there anyway to preserve the splits in this and
then apply it to the new data. For example, assuming that X =
array([[1,2,3],[4,5,6],[7,8,9]]) and X2 = array([[11],[12],[13]]). Normally I
would want to hstack them so that X_new =
array([[1,2,3,11],[4,5,6,12],[7,8,9,13]]), but if the order has been shuffled
due to the various instances falling into different folds, I don't think that
it could be guaranteed that the X_new matrix will be generated in the same way
as listed above.
In short how can I account for the gridsearch's cv when doing the hstack in
this situation? Thanks.
Sebastian
________________________________________
From: Sebastian Raschka [se.rasc...@gmail.com<mailto:se.rasc...@gmail.com>]
Sent: Thursday, August 21, 2014 10:41 PM
To:
scikit-learn-general@lists.sourceforge.net<mailto:scikit-learn-general@lists.sourceforge.net>
Subject: Re: [Scikit-learn-general] Pipeline - Convert to dense
Hi, Zoraida,
thanks for the follow up! I went with a short, custom ColumnSelector class, but
the itemgetter is even nicer.
Best,
Sebastian
On Aug 21, 2014, at 2:57 PM, ZORAIDA HIDALGO SANCHEZ
<zoraida.hidalgosanc...@telefonica.com<mailto:zoraida.hidalgosanc...@telefonica.com>>
wrote:
> Sebastian,
>
> a few days ago, I asked a very similar question and I got this link as a
> response:
>
> https://github.com/scikit-learn/scikit-learn/issues/2034
>
>
> I think that you could try something similar.
>
>
> Best,
>
> Zoraida.-
>
> El 21/08/14 18:48, "Sebastian Okser" <seo...@utu.fi<mailto:seo...@utu.fi>>
> escribió:
>
>> I am trying to use the pipeline combined with a countvectorizer,
>> tfidftransformer and randomforest. However the output of the second step
>> is a sparse array and randomforest requires a dense one. How can I add a
>> step to allow for a conversion of the matrix from sparse to dense, using
>> something along the lines of data.toarray(). Additionally, I would like
>> to add some additional features to the dataset after the text has been
>> processed. How can I create a step for this (normally I could use
>> something like hstack)? My code is as follows:
>>
>> pipeline = Pipeline([
>> ('vect', CountVectorizer()),
>> ('tfidf', TfidfTransformer()),
>> ('clf', OneVsRestClassifier(SVC(probability=True))),
>> ])
>> I would like to adjust this somehow to the following:
>>
>> pipeline = Pipeline([
>> ('vect', CountVectorizer()),
>> ('tfidf', TfidfTransformer()),
>> ('change_to_dense', SOME HOW CHANGE TO DENSE),
>> ('add_more_data', SOME HOW ADD FEATURES),
>> ('clf', OneVsRestClassifier(SVC(probability=True))),
>> ])
>>
>> My first dataset, lets call it data1 is just an array of sentences. Below
>> is an example:
>>
>> data1 = ['This is the first sentence',
>> 'This is the second sentence',
>> 'This is the third sentence']
>>
>> The second dataset is numerical data of the following form:
>>
>> data2 = array([[0],
>> [1],
>> [0]])
>>
>>
>> Thanks!
>> --------------------------------------------------------------------------
>> ----
>> Slashdot TV.
>> Video for Nerds. Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ________________________________
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el destinatario
> indicado, queda notificado de que la lectura, utilización, divulgación y/o
> copia sin autorización puede estar prohibida en virtud de la legislación
> vigente. Si ha recibido este mensaje por error, le rogamos que nos lo
> comunique inmediatamente por esta misma vía y proceda a su destrucción.
>
> The information contained in this transmission is privileged and confidential
> information intended only for the use of the individual or entity named
> above. If the reader of this message is not the intended recipient, you are
> hereby notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this transmission
> in error, do not read it. Please immediately reply to the sender that you
> have received this communication in error and then delete it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo da
> pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia
> sem autorização pode estar proibida em virtude da legislação vigente. Se
> recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente
> por esta mesma via e proceda a sua destruição
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds. Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Mvh
Anders Aagaard
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general