I am not convinced we need this, even if only in the docs. We should not
encourage users to store sparse data in CSV format. Storing a large
high-dimensional dataset in CSV format could easily consume an entire disk
(if not compressed). Reading from the network row by row is even worse as
it would transfer megabytes just for the explicitly stored zeros.
M.
On Sun, Aug 31, 2014 at 6:39 PM, Eustache DIEMERT <eusta...@diemert.fr>
wrote:
> Well yes, CSV is not particularly suited to sparse data but the technique
> showed by Lars could be applied to any row oriented format, be it text or
> data read from the network.
>
>
> 2014-08-31 10:56 GMT+02:00 Mathieu Blondel <math...@mblondel.org>:
>
> Do you store zero entries explicitly in your CSV format? CSV doesn't
>> strike me as the best choice for representing sparse data...
>>
>> M.
>>
>>
>> On Sun, Aug 31, 2014 at 5:21 PM, Eustache DIEMERT <eusta...@diemert.fr>
>> wrote:
>>
>>> @Lars, shouldn't the last line of the for loop be
>>>
>>> indptr.append(indptr[-1]+len(nonzero))
>>>
>>> rather than
>>>
>>> indptr.append(i)
>>>
>>> ?
>>>
>>> FYI, here is the PR to include your snippet into the doc:
>>>
>>> https://github.com/scikit-learn/scikit-learn/pull/3610
>>>
>>> Eustache
>>>
>>>
>>> 2014-07-29 11:24 GMT+02:00 Lars Buitinck <larsm...@gmail.com>:
>>>
>>> 2014-07-29 10:22 GMT+02:00 Eustache DIEMERT <eusta...@diemert.fr>:
>>>> > So my question is : is there some utility or snippet to load a CSV
>>>> into CSR
>>>> > that I overlooked ?
>>>>
>>>> No, but it's not that hard to write [1].
>>>>
>>>>
>>>> >>> import array
>>>> >>> data = array.array("f")
>>>> >>> indices = array.array("i")
>>>> >>> indptr = array.array("i", [0])
>>>> >>> for i, row in enumerate(csv.reader(f), 1):
>>>> ... row = np.array(map(float, row))
>>>> ... n_features = len(row)
>>>> ... nonzero = np.where(row)[0]
>>>> ... data.extend(row[nonzero])
>>>> ... indices.extend(nonzero)
>>>> ... indptr.append(i)
>>>> ...
>>>> >>> X = csr_matrix((data, indices, indptr), dtype=float, shape=(i,
>>>> n_features))
>>>>
>>>>
>>>> Instead of arrays, you can also use plain lists. Arrays take less
>>>> space, but they can be a tiny bit slower than lists.
>>>>
>>>>
>>>> [1] https://gist.github.com/larsmans/fe2a289818299dcb094a
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Infragistics Professional
>>>> Build stunning WinForms apps today!
>>>> Reboot your WinForms applications with our WinForms controls.
>>>> Build a bridge from your legacy apps to the future.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Slashdot TV.
>>> Video for Nerds. Stuff that matters.
>>> http://tv.slashdot.org/
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds. Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general