Much obliged again. I'll aim to test this and (it'll be a few weeks -
I'm away) report back with an answer. Cheers :-) i.

On 2 May 2014 00:33, Joel Nothman <joel.noth...@gmail.com> wrote:
>> The first time I set a value in the new column I get:
>> SparseEfficiencyWarning: Changing the sparsity structure of a
>> csr_matrix is expensive.
>
>
> This is not as true as it was three months ago, and I think in three months
> time it'll be even less true. But this is only true because it has to handle
> the general case: that there's already a value for that feature (if not
> multiple to be interpreted as their sum); that you might have duplicate
> indices that you're setting; to produce sorted indices; etc. So if you're
> only appending new features, there are faster ways to hack indices, data and
> indptr yourself.
>
>> Another possibility is to overallocate using a wrapper, then fill the
>> spare allocation, then resize if required.
>
>
> You mean allocating a larger array for indices, data than required, and
> shifting data across as needed? Is avoiding a copy that essential? It's
> probably easier just to rebuild the arrays as required.
>
> In CSR, if I wanted to insert a new feature into column X.shape[1], such
> that rows i are set to corresponding values x, I could do something like
> (untested):
>
> # update indptr
> row_nnz = np.diff(X.indptr)
> row_nnz[i] += 1
> indptr_new = np.hstack([0, np.cumsum(row_nnz)])
> # insert data at last position for each affected row
> # e.g. if i == 0, insert before index indptr_new[1]
> indices_new = np.insert(X.indices, indptr_new[i + 1], X.shape[1])
> data_new = np.insert(X.data, indptr_new[i + 1], data)
> X_new = csr_matrix((data_new, indices_new, indptr_new), shape=(X.shape[0],
> X.shape[1] + 1))
>
> (And you can cut out a couple of copies of indptr, but I've avoided this for
> clarity.)
>
> Hope that helps!
>
> - Joel
>
>
> On 2 May 2014 01:11, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>
>> @Joel your solution looks like what I need. I'd gone as far as
>> figuring out data/row/col and concatenating a new 1D column vector to
>> my large data, but hadn't realised I could just use
>> data/indices/indptr for a new matrix (I've not used sparse matrices in
>> anger before). The first time I set a value in the new column I get:
>> SparseEfficiencyWarning: Changing the sparsity structure of a
>> csr_matrix is expensive. lil_matrix is more efficient.
>> and I don't get this by using the concatenate solution (though that's
>> slow to construct), though subsequent timings don't seem any
>> different.
>>
>> Another possibility is to overallocate using a wrapper, then fill the
>> spare allocation, then resize if required. I'll have to test the
>> timings and tradeoffs.
>>
>> @Lars/Joel - thanks for confirming that the sort is cosmetic and that
>> I wasn't missing anything too obvious. @Lars thanks for the note about
>> inheritance.
>>
>> Cheers, i.
>>
>> On 1 May 2014 15:24, Joel Nothman <joel.noth...@gmail.com> wrote:
>> > Hi Ian,
>> >
>> > There is no functional reason for sorting the features. It arguably
>> > improves
>> > usability. Certainly, you can append features without having to re-sort.
>> >
>> > To find efficient ways of resizing a sparse matrix, you might need to be
>> > more specific about the way in which you want to expand it. For example
>> > if I
>> > have a CSR X, and I want to append columns for new features but leave
>> > them
>> > as zeros for the rows already in X, this is a trivial operation:
>> > X_new = csr_matrix((X.data, X.indices, X.indptr), shape=(X.shape[0],
>> > X.shape[1] + n_additional_features))
>> >
>> > Inserting values into those new features would be much easier to hack in
>> > CSC, and has a fast path implementation in scipy.sparse.hstack.
>> >
>> > I've also got code to handle the case where you have X1 and X2
>> > constructed
>> > with different feature names and you want to concatenate them with
>> > aligned
>> > features.
>> >
>> > Cheers,
>> >
>> > - Joel
>> >
>> >
>> >
>> > On 1 May 2014 23:59, Ian Ozsvald <i...@ianozsvald.com> wrote:
>> >>
>> >> Hello. I'm looking at feature_extraction.dict_vectorizer and I'm
>> >> wondering why fit() and restrict() use a sorted list of feature names
>> >> rather than their naturally-encountered order?
>> >>
>> >> Is there an algorithmic requirement somewhere for sorted feature names?
>> >>
>> >> Context - I'm working on a similarity-measurement system (45k cols *
>> >> 1mil rows, csr sparse matrix, 600MB), one requirement will be to
>> >> occasionally add a column or row. Avoiding a full rebuild of the
>> >> vectorizer and dynamically updating the mapping seems like a sensible
>> >> idea, but I'm not understanding why the feature name list is sorted().
>> >> I'm slowly working through the client's requirements to see if
>> >> avoiding a full rebuild is feasible. This is for an online production
>> >> system.
>> >>
>> >> Reusing (and probably inheriting) the sklearn vectorizer would be
>> >> nice, rather than rolling a custom solution in numpy. If anyone's
>> >> curious, my best approach to resizing the csr array is via
>> >>
>> >>
>> >> http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880
>> >> which costs 10 seconds and a temporary +2GB overall.
>> >> (and if you have a better suggestion for growing a csr matrix, I'd
>> >> love to hear it)
>> >>
>> >> Now if anyone's done this sort of thing before and wants to chat about
>> >> it, I'd love to say Hi.
>> >>
>> >> Ian.
>> >>
>> >> --
>> >> Ian Ozsvald (A.I. researcher)
>> >> i...@ianozsvald.com
>> >>
>> >> http://IanOzsvald.com
>> >> http://ModelInsight.io
>> >> http://MorConsulting.com
>> >> http://Annotate.IO
>> >> http://SocialTiesApp.com
>> >> http://TheScreencastingHandbook.com
>> >> http://FivePoundApp.com
>> >> http://twitter.com/IanOzsvald
>> >> http://ShowMeDo.com
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> >> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>> >> unparalleled scalability from the best Selenium testing platform
>> >> available.
>> >> Simple to use. Nothing to install. Get started now for free."
>> >> http://p.sf.net/sfu/SauceLabs
>> >> _______________________________________________
>> >> Scikit-learn-general mailing list
>> >> Scikit-learn-general@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> > Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>> > unparalleled scalability from the best Selenium testing platform
>> > available.
>> > Simple to use. Nothing to install. Get started now for free."
>> > http://p.sf.net/sfu/SauceLabs
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>>
>> --
>> Ian Ozsvald (A.I. researcher)
>> i...@ianozsvald.com
>>
>> http://IanOzsvald.com
>> http://ModelInsight.io
>> http://MorConsulting.com
>> http://Annotate.IO
>> http://SocialTiesApp.com
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com
>> http://twitter.com/IanOzsvald
>> http://ShowMeDo.com
>>
>>
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>> unparalleled scalability from the best Selenium testing platform
>> available.
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
> unparalleled scalability from the best Selenium testing platform available.
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to