@Joel your solution looks like what I need. I'd gone as far as figuring out data/row/col and concatenating a new 1D column vector to my large data, but hadn't realised I could just use data/indices/indptr for a new matrix (I've not used sparse matrices in anger before). The first time I set a value in the new column I get: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. and I don't get this by using the concatenate solution (though that's slow to construct), though subsequent timings don't seem any different.
Another possibility is to overallocate using a wrapper, then fill the spare allocation, then resize if required. I'll have to test the timings and tradeoffs. @Lars/Joel - thanks for confirming that the sort is cosmetic and that I wasn't missing anything too obvious. @Lars thanks for the note about inheritance. Cheers, i. On 1 May 2014 15:24, Joel Nothman <joel.noth...@gmail.com> wrote: > Hi Ian, > > There is no functional reason for sorting the features. It arguably improves > usability. Certainly, you can append features without having to re-sort. > > To find efficient ways of resizing a sparse matrix, you might need to be > more specific about the way in which you want to expand it. For example if I > have a CSR X, and I want to append columns for new features but leave them > as zeros for the rows already in X, this is a trivial operation: > X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0], > X.shape[1] + n_additional_features)) > > Inserting values into those new features would be much easier to hack in > CSC, and has a fast path implementation in scipy.sparse.hstack. > > I've also got code to handle the case where you have X1 and X2 constructed > with different feature names and you want to concatenate them with aligned > features. > > Cheers, > > - Joel > > > > On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote: >> >> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm >> wondering why fit() and restrict() use a sorted list of feature names >> rather than their naturally-encountered order? >> >> Is there an algorithmic requirement somewhere for sorted feature names? >> >> Context - I'm working on a similarity-measurement system (45k cols * >> 1mil rows, csr sparse matrix, 600MB), one requirement will be to >> occasionally add a column or row. Avoiding a full rebuild of the >> vectorizer and dynamically updating the mapping seems like a sensible >> idea, but I'm not understanding why the feature name list is sorted(). >> I'm slowly working through the client's requirements to see if >> avoiding a full rebuild is feasible. This is for an online production >> system. >> >> Reusing (and probably inheriting) the sklearn vectorizer would be >> nice, rather than rolling a custom solution in numpy. If anyone's >> curious, my best approach to resizing the csr array is via >> >> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880 >> which costs 10 seconds and a temporary +2GB overall. >> (and if you have a better suggestion for growing a csr matrix, I'd >> love to hear it) >> >> Now if anyone's done this sort of thing before and wants to chat about >> it, I'd love to say Hi. >> >> Ian. >> >> -- >> Ian Ozsvald (A.I. researcher) >> i...@ianozsvald.com >> >> http://IanOzsvald.com >> http://ModelInsight.io >> http://MorConsulting.com >> http://Annotate.IO >> http://SocialTiesApp.com >> http://TheScreencastingHandbook.com >> http://FivePoundApp.com >> http://twitter.com/IanOzsvald >> http://ShowMeDo.com >> >> >> ------------------------------------------------------------------------------ >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> Instantly run your Selenium tests across 300+ browser/OS combos. Get >> unparalleled scalability from the best Selenium testing platform >> available. >> Simple to use. Nothing to install. Get started now for free." >> http://p.sf.net/sfu/SauceLabs >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. Get > unparalleled scalability from the best Selenium testing platform available. > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://ModelInsight.io http://MorConsulting.com http://Annotate.IO http://SocialTiesApp.com http://TheScreencastingHandbook.com http://FivePoundApp.com http://twitter.com/IanOzsvald http://ShowMeDo.com ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general