Re: [Scikit-learn-general] Why sorted feature_names_ in dict_vectorizer.fit?

Joel Nothman Thu, 01 May 2014 07:33:44 -0700

Hi Ian,

There is no functional reason for sorting the features. It arguably
improves usability. Certainly, you can append features without having to
re-sort.


To find efficient ways of resizing a sparse matrix, you might need to be
more specific about the way in which you want to expand it. For example if
I have a CSR X, and I want to append columns for new features but leave
them as zeros for the rows already in X, this is a trivial operation:
X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0],
X.shape[1] + n_additional_features))

Inserting values into those new features would be much easier to hack in
CSC, and has a fast path implementation in scipy.sparse.hstack.

I've also got code to handle the case where you have X1 and X2 constructed
with different feature names and you want to concatenate them with aligned
features.

Cheers,

- Joel



On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote:

> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm
> wondering why fit() and restrict() use a sorted list of feature names
> rather than their naturally-encountered order?
>
> Is there an algorithmic requirement somewhere for sorted feature names?
>
> Context - I'm working on a similarity-measurement system (45k cols *
> 1mil rows, csr sparse matrix, 600MB), one requirement will be to
> occasionally add a column or row. Avoiding a full rebuild of the
> vectorizer and dynamically updating the mapping seems like a sensible
> idea, but I'm not understanding why the feature name list is sorted().
> I'm slowly working through the client's requirements to see if
> avoiding a full rebuild is feasible. This is for an online production
> system.
>
> Reusing (and probably inheriting) the sklearn vectorizer would be
> nice, rather than rolling a custom solution in numpy. If anyone's
> curious, my best approach to resizing the csr array is via
>
> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880
> which costs 10 seconds and a temporary +2GB overall.
> (and if you have a better suggestion for growing a csr matrix, I'd
> love to hear it)
>
> Now if anyone's done this sort of thing before and wants to chat about
> it, I'd love to say Hi.
>
> Ian.
>
> --
> Ian Ozsvald (A.I. researcher)
> i...@ianozsvald.com
>
> http://IanOzsvald.com
> http://ModelInsight.io
> http://MorConsulting.com
> http://Annotate.IO
> http://SocialTiesApp.com
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com
> http://twitter.com/IanOzsvald
> http://ShowMeDo.com
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Why *sorted* feature_names_ in dict_vectorizer.fit?

Reply via email to

Re: [Scikit-learn-general] Why sorted feature_names_ in dict_vectorizer.fit?