Re: [Scikit-learn-general] Why sorted feature_names_ in dict_vectorizer.fit?

Ian Ozsvald Thu, 01 May 2014 08:13:21 -0700

@Joel your solution looks like what I need. I'd gone as far as
figuring out data/row/col and concatenating a new 1D column vector to
my large data, but hadn't realised I could just use
data/indices/indptr for a new matrix (I've not used sparse matrices in
anger before). The first time I set a value in the new column I get:
SparseEfficiencyWarning: Changing the sparsity structure of a
csr_matrix is expensive. lil_matrix is more efficient.
and I don't get this by using the concatenate solution (though that's
slow to construct), though subsequent timings don't seem any
different.


Another possibility is to overallocate using a wrapper, then fill the
spare allocation, then resize if required. I'll have to test the
timings and tradeoffs.

@Lars/Joel - thanks for confirming that the sort is cosmetic and that
I wasn't missing anything too obvious. @Lars thanks for the note about
inheritance.

Cheers, i.

On 1 May 2014 15:24, Joel Nothman <joel.noth...@gmail.com> wrote:
> Hi Ian,
>
> There is no functional reason for sorting the features. It arguably improves
> usability. Certainly, you can append features without having to re-sort.
>
> To find efficient ways of resizing a sparse matrix, you might need to be
> more specific about the way in which you want to expand it. For example if I
> have a CSR X, and I want to append columns for new features but leave them
> as zeros for the rows already in X, this is a trivial operation:
> X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0],
> X.shape[1] + n_additional_features))
>
> Inserting values into those new features would be much easier to hack in
> CSC, and has a fast path implementation in scipy.sparse.hstack.
>
> I've also got code to handle the case where you have X1 and X2 constructed
> with different feature names and you want to concatenate them with aligned
> features.
>
> Cheers,
>
> - Joel
>
>
>
> On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>
>> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm
>> wondering why fit() and restrict() use a sorted list of feature names
>> rather than their naturally-encountered order?
>>
>> Is there an algorithmic requirement somewhere for sorted feature names?
>>
>> Context - I'm working on a similarity-measurement system (45k cols *
>> 1mil rows, csr sparse matrix, 600MB), one requirement will be to
>> occasionally add a column or row. Avoiding a full rebuild of the
>> vectorizer and dynamically updating the mapping seems like a sensible
>> idea, but I'm not understanding why the feature name list is sorted().
>> I'm slowly working through the client's requirements to see if
>> avoiding a full rebuild is feasible. This is for an online production
>> system.
>>
>> Reusing (and probably inheriting) the sklearn vectorizer would be
>> nice, rather than rolling a custom solution in numpy. If anyone's
>> curious, my best approach to resizing the csr array is via
>>
>> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880
>> which costs 10 seconds and a temporary +2GB overall.
>> (and if you have a better suggestion for growing a csr matrix, I'd
>> love to hear it)
>>
>> Now if anyone's done this sort of thing before and wants to chat about
>> it, I'd love to say Hi.
>>
>> Ian.
>>
>> --
>> Ian Ozsvald (A.I. researcher)
>> i...@ianozsvald.com
>>
>> http://IanOzsvald.com
>> http://ModelInsight.io
>> http://MorConsulting.com
>> http://Annotate.IO
>> http://SocialTiesApp.com
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com
>> http://twitter.com/IanOzsvald
>> http://ShowMeDo.com
>>
>>
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>> unparalleled scalability from the best Selenium testing platform
>> available.
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Why *sorted* feature_names_ in dict_vectorizer.fit?

Reply via email to

Re: [Scikit-learn-general] Why sorted feature_names_ in dict_vectorizer.fit?