>
> The first time I set a value in the new column I get:
> SparseEfficiencyWarning: Changing the sparsity structure of a
> csr_matrix is expensive.
This is not as true as it was three months ago, and I think in three months
time it'll be even less true. But this is only true because it has to
handle the general case: that there's already a value for that feature (if
not multiple to be interpreted as their sum); that you might have duplicate
indices that you're setting; to produce sorted indices; etc. So if you're
only appending new features, there are faster ways to hack indices, data
and indptr yourself.
Another possibility is to overallocate using a wrapper, then fill the
> spare allocation, then resize if required.
You mean allocating a larger array for indices, data than required, and
shifting data across as needed? Is avoiding a copy that essential? It's
probably easier just to rebuild the arrays as required.
In CSR, if I wanted to insert a new feature into column X.shape[1], such
that rows i are set to corresponding values x, I could do something like
(untested):
# update indptr
row_nnz = np.diff(X.indptr)
row_nnz[i] += 1
indptr_new = np.hstack([0, np.cumsum(row_nnz)])
# insert data at last position for each affected row
# e.g. if i == 0, insert before index indptr_new[1]
indices_new = np.insert(X.indices, indptr_new[i + 1], X.shape[1])
data_new = np.insert(X.data, indptr_new[i + 1], data)
X_new = csr_matrix((data_new, indices_new, indptr_new), shape=(X.shape[0],
X.shape[1] + 1))
(And you can cut out a couple of copies of indptr, but I've avoided this
for clarity.)
Hope that helps!
- Joel
On 2 May 2014 01:11, Ian Ozsvald <i...@ianozsvald.com> wrote:
> @Joel your solution looks like what I need. I'd gone as far as
> figuring out data/row/col and concatenating a new 1D column vector to
> my large data, but hadn't realised I could just use
> data/indices/indptr for a new matrix (I've not used sparse matrices in
> anger before). The first time I set a value in the new column I get:
> SparseEfficiencyWarning: Changing the sparsity structure of a
> csr_matrix is expensive. lil_matrix is more efficient.
> and I don't get this by using the concatenate solution (though that's
> slow to construct), though subsequent timings don't seem any
> different.
>
> Another possibility is to overallocate using a wrapper, then fill the
> spare allocation, then resize if required. I'll have to test the
> timings and tradeoffs.
>
> @Lars/Joel - thanks for confirming that the sort is cosmetic and that
> I wasn't missing anything too obvious. @Lars thanks for the note about
> inheritance.
>
> Cheers, i.
>
> On 1 May 2014 15:24, Joel Nothman <joel.noth...@gmail.com> wrote:
> > Hi Ian,
> >
> > There is no functional reason for sorting the features. It arguably
> improves
> > usability. Certainly, you can append features without having to re-sort.
> >
> > To find efficient ways of resizing a sparse matrix, you might need to be
> > more specific about the way in which you want to expand it. For example
> if I
> > have a CSR X, and I want to append columns for new features but leave
> them
> > as zeros for the rows already in X, this is a trivial operation:
> > X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0],
> > X.shape[1] + n_additional_features))
> >
> > Inserting values into those new features would be much easier to hack in
> > CSC, and has a fast path implementation in scipy.sparse.hstack.
> >
> > I've also got code to handle the case where you have X1 and X2
> constructed
> > with different feature names and you want to concatenate them with
> aligned
> > features.
> >
> > Cheers,
> >
> > - Joel
> >
> >
> >
> > On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote:
> >>
> >> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm
> >> wondering why fit() and restrict() use a sorted list of feature names
> >> rather than their naturally-encountered order?
> >>
> >> Is there an algorithmic requirement somewhere for sorted feature names?
> >>
> >> Context - I'm working on a similarity-measurement system (45k cols *
> >> 1mil rows, csr sparse matrix, 600MB), one requirement will be to
> >> occasionally add a column or row. Avoiding a full rebuild of the
> >> vectorizer and dynamically updating the mapping seems like a sensible
> >> idea, but I'm not understanding why the feature name list is sorted().
> >> I'm slowly working through the client's requirements to see if
> >> avoiding a full rebuild is feasible. This is for an online production
> >> system.
> >>
> >> Reusing (and probably inheriting) the sklearn vectorizer would be
> >> nice, rather than rolling a custom solution in numpy. If anyone's
> >> curious, my best approach to resizing the csr array is via
> >>
> >>
> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880
> >> which costs 10 seconds and a temporary +2GB overall.
> >> (and if you have a better suggestion for growing a csr matrix, I'd
> >> love to hear it)
> >>
> >> Now if anyone's done this sort of thing before and wants to chat about
> >> it, I'd love to say Hi.
> >>
> >> Ian.
> >>
> >> --
> >> Ian Ozsvald (A.I. researcher)
> >> i...@ianozsvald.com
> >>
> >> http://IanOzsvald.com
> >> http://ModelInsight.io
> >> http://MorConsulting.com
> >> http://Annotate.IO
> >> http://SocialTiesApp.com
> >> http://TheScreencastingHandbook.com
> >> http://FivePoundApp.com
> >> http://twitter.com/IanOzsvald
> >> http://ShowMeDo.com
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> >> Instantly run your Selenium tests across 300+ browser/OS combos. Get
> >> unparalleled scalability from the best Selenium testing platform
> >> available.
> >> Simple to use. Nothing to install. Get started now for free."
> >> http://p.sf.net/sfu/SauceLabs
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> > Instantly run your Selenium tests across 300+ browser/OS combos. Get
> > unparalleled scalability from the best Selenium testing platform
> available.
> > Simple to use. Nothing to install. Get started now for free."
> > http://p.sf.net/sfu/SauceLabs
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> --
> Ian Ozsvald (A.I. researcher)
> i...@ianozsvald.com
>
> http://IanOzsvald.com
> http://ModelInsight.io
> http://MorConsulting.com
> http://Annotate.IO
> http://SocialTiesApp.com
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com
> http://twitter.com/IanOzsvald
> http://ShowMeDo.com
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos. Get
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos. Get
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general