> We should not encourage users to store sparse data in CSV format.

+1

> the technique showed by Lars could be applied to any row oriented format,
be it text or data read from the network.

Perhaps, but then they can construct a sparse format, such as a dict that
is passed to DictVectorizer.


On 31 August 2014 20:03, Mathieu Blondel <math...@mblondel.org> wrote:

> I am not convinced we need this, even if only in the docs. We should not
> encourage users to store sparse data in CSV format. Storing a large
> high-dimensional dataset in CSV format could easily consume an entire disk
> (if not compressed). Reading from the network row by row is even worse as
> it would transfer megabytes just for the explicitly stored zeros.
>
> M.
>
>
> On Sun, Aug 31, 2014 at 6:39 PM, Eustache DIEMERT <eusta...@diemert.fr>
> wrote:
>
>> Well yes, CSV is not particularly suited to sparse data but the technique
>> showed by Lars could be applied to any row oriented format, be it text or
>> data read from the network.
>>
>>
>> 2014-08-31 10:56 GMT+02:00 Mathieu Blondel <math...@mblondel.org>:
>>
>> Do you store zero entries explicitly in your CSV format? CSV doesn't
>>> strike me as the best choice for representing sparse data...
>>>
>>> M.
>>>
>>>
>>>  On Sun, Aug 31, 2014 at 5:21 PM, Eustache DIEMERT <eusta...@diemert.fr>
>>> wrote:
>>>
>>>>  @Lars, shouldn't the last line of the for loop be
>>>>
>>>>   indptr.append(indptr[-1]+len(nonzero))
>>>>
>>>> rather than
>>>>
>>>>    indptr.append(i)
>>>>
>>>> ?
>>>>
>>>> FYI, here is the PR to include your snippet into the doc:
>>>>
>>>> https://github.com/scikit-learn/scikit-learn/pull/3610
>>>>
>>>> Eustache
>>>>
>>>>
>>>> 2014-07-29 11:24 GMT+02:00 Lars Buitinck <larsm...@gmail.com>:
>>>>
>>>> 2014-07-29 10:22 GMT+02:00 Eustache DIEMERT <eusta...@diemert.fr>:
>>>>> > So my question is : is there some utility or snippet to load a CSV
>>>>> into CSR
>>>>> > that I overlooked ?
>>>>>
>>>>> No, but it's not that hard to write [1].
>>>>>
>>>>>
>>>>> >>> import array
>>>>> >>> data = array.array("f")
>>>>> >>> indices = array.array("i")
>>>>> >>> indptr = array.array("i", [0])
>>>>> >>> for i, row in enumerate(csv.reader(f), 1):
>>>>> ...     row = np.array(map(float, row))
>>>>> ...     n_features = len(row)
>>>>> ...     nonzero = np.where(row)[0]
>>>>> ...     data.extend(row[nonzero])
>>>>> ...     indices.extend(nonzero)
>>>>> ...     indptr.append(i)
>>>>> ...
>>>>> >>> X = csr_matrix((data, indices, indptr), dtype=float, shape=(i,
>>>>> n_features))
>>>>>
>>>>>
>>>>> Instead of arrays, you can also use plain lists. Arrays take less
>>>>> space, but they can be a tiny bit slower than lists.
>>>>>
>>>>>
>>>>> [1] https://gist.github.com/larsmans/fe2a289818299dcb094a
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Infragistics Professional
>>>>> Build stunning WinForms apps today!
>>>>> Reboot your WinForms applications with our WinForms controls.
>>>>> Build a bridge from your legacy apps to the future.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Slashdot TV.
>>>> Video for Nerds.  Stuff that matters.
>>>> http://tv.slashdot.org/
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Slashdot TV.
>>> Video for Nerds.  Stuff that matters.
>>> http://tv.slashdot.org/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to