Much obliged again. I'll aim to test this and (it'll be a few weeks - I'm away) report back with an answer. Cheers :-) i.
On 2 May 2014 00:33, Joel Nothman <joel.noth...@gmail.com> wrote: >> The first time I set a value in the new column I get: >> SparseEfficiencyWarning: Changing the sparsity structure of a >> csr_matrix is expensive. > > > This is not as true as it was three months ago, and I think in three months > time it'll be even less true. But this is only true because it has to handle > the general case: that there's already a value for that feature (if not > multiple to be interpreted as their sum); that you might have duplicate > indices that you're setting; to produce sorted indices; etc. So if you're > only appending new features, there are faster ways to hack indices, data and > indptr yourself. > >> Another possibility is to overallocate using a wrapper, then fill the >> spare allocation, then resize if required. > > > You mean allocating a larger array for indices, data than required, and > shifting data across as needed? Is avoiding a copy that essential? It's > probably easier just to rebuild the arrays as required. > > In CSR, if I wanted to insert a new feature into column X.shape[1], such > that rows i are set to corresponding values x, I could do something like > (untested): > > # update indptr > row_nnz = np.diff(X.indptr) > row_nnz[i] += 1 > indptr_new = np.hstack([0, np.cumsum(row_nnz)]) > # insert data at last position for each affected row > # e.g. if i == 0, insert before index indptr_new[1] > indices_new = np.insert(X.indices, indptr_new[i + 1], X.shape[1]) > data_new = np.insert(X.data, indptr_new[i + 1], data) > X_new = csr_matrix((data_new, indices_new, indptr_new), shape=(X.shape[0], > X.shape[1] + 1)) > > (And you can cut out a couple of copies of indptr, but I've avoided this for > clarity.) > > Hope that helps! > > - Joel > > > On 2 May 2014 01:11, Ian Ozsvald <i...@ianozsvald.com> wrote: >> >> @Joel your solution looks like what I need. I'd gone as far as >> figuring out data/row/col and concatenating a new 1D column vector to >> my large data, but hadn't realised I could just use >> data/indices/indptr for a new matrix (I've not used sparse matrices in >> anger before). The first time I set a value in the new column I get: >> SparseEfficiencyWarning: Changing the sparsity structure of a >> csr_matrix is expensive. lil_matrix is more efficient. >> and I don't get this by using the concatenate solution (though that's >> slow to construct), though subsequent timings don't seem any >> different. >> >> Another possibility is to overallocate using a wrapper, then fill the >> spare allocation, then resize if required. I'll have to test the >> timings and tradeoffs. >> >> @Lars/Joel - thanks for confirming that the sort is cosmetic and that >> I wasn't missing anything too obvious. @Lars thanks for the note about >> inheritance. >> >> Cheers, i. >> >> On 1 May 2014 15:24, Joel Nothman <joel.noth...@gmail.com> wrote: >> > Hi Ian, >> > >> > There is no functional reason for sorting the features. It arguably >> > improves >> > usability. Certainly, you can append features without having to re-sort. >> > >> > To find efficient ways of resizing a sparse matrix, you might need to be >> > more specific about the way in which you want to expand it. For example >> > if I >> > have a CSR X, and I want to append columns for new features but leave >> > them >> > as zeros for the rows already in X, this is a trivial operation: >> > X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0], >> > X.shape[1] + n_additional_features)) >> > >> > Inserting values into those new features would be much easier to hack in >> > CSC, and has a fast path implementation in scipy.sparse.hstack. >> > >> > I've also got code to handle the case where you have X1 and X2 >> > constructed >> > with different feature names and you want to concatenate them with >> > aligned >> > features. >> > >> > Cheers, >> > >> > - Joel >> > >> > >> > >> > On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote: >> >> >> >> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm >> >> wondering why fit() and restrict() use a sorted list of feature names >> >> rather than their naturally-encountered order? >> >> >> >> Is there an algorithmic requirement somewhere for sorted feature names? >> >> >> >> Context - I'm working on a similarity-measurement system (45k cols * >> >> 1mil rows, csr sparse matrix, 600MB), one requirement will be to >> >> occasionally add a column or row. Avoiding a full rebuild of the >> >> vectorizer and dynamically updating the mapping seems like a sensible >> >> idea, but I'm not understanding why the feature name list is sorted(). >> >> I'm slowly working through the client's requirements to see if >> >> avoiding a full rebuild is feasible. This is for an online production >> >> system. >> >> >> >> Reusing (and probably inheriting) the sklearn vectorizer would be >> >> nice, rather than rolling a custom solution in numpy. If anyone's >> >> curious, my best approach to resizing the csr array is via >> >> >> >> >> >> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880 >> >> which costs 10 seconds and a temporary +2GB overall. >> >> (and if you have a better suggestion for growing a csr matrix, I'd >> >> love to hear it) >> >> >> >> Now if anyone's done this sort of thing before and wants to chat about >> >> it, I'd love to say Hi. >> >> >> >> Ian. >> >> >> >> -- >> >> Ian Ozsvald (A.I. researcher) >> >> i...@ianozsvald.com >> >> >> >> http://IanOzsvald.com >> >> http://ModelInsight.io >> >> http://MorConsulting.com >> >> http://Annotate.IO >> >> http://SocialTiesApp.com >> >> http://TheScreencastingHandbook.com >> >> http://FivePoundApp.com >> >> http://twitter.com/IanOzsvald >> >> http://ShowMeDo.com >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> >> Instantly run your Selenium tests across 300+ browser/OS combos. Get >> >> unparalleled scalability from the best Selenium testing platform >> >> available. >> >> Simple to use. Nothing to install. Get started now for free." >> >> http://p.sf.net/sfu/SauceLabs >> >> _______________________________________________ >> >> Scikit-learn-general mailing list >> >> Scikit-learn-general@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> > Instantly run your Selenium tests across 300+ browser/OS combos. Get >> > unparalleled scalability from the best Selenium testing platform >> > available. >> > Simple to use. Nothing to install. Get started now for free." >> > http://p.sf.net/sfu/SauceLabs >> > _______________________________________________ >> > Scikit-learn-general mailing list >> > Scikit-learn-general@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >> >> >> >> -- >> Ian Ozsvald (A.I. researcher) >> i...@ianozsvald.com >> >> http://IanOzsvald.com >> http://ModelInsight.io >> http://MorConsulting.com >> http://Annotate.IO >> http://SocialTiesApp.com >> http://TheScreencastingHandbook.com >> http://FivePoundApp.com >> http://twitter.com/IanOzsvald >> http://ShowMeDo.com >> >> >> ------------------------------------------------------------------------------ >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >> Instantly run your Selenium tests across 300+ browser/OS combos. Get >> unparalleled scalability from the best Selenium testing platform >> available. >> Simple to use. Nothing to install. Get started now for free." >> http://p.sf.net/sfu/SauceLabs >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > Instantly run your Selenium tests across 300+ browser/OS combos. Get > unparalleled scalability from the best Selenium testing platform available. > Simple to use. Nothing to install. Get started now for free." > http://p.sf.net/sfu/SauceLabs > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://ModelInsight.io http://MorConsulting.com http://Annotate.IO http://SocialTiesApp.com http://TheScreencastingHandbook.com http://FivePoundApp.com http://twitter.com/IanOzsvald http://ShowMeDo.com ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general