Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point.
Cheers, Sarah On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.noth...@gmail.com> wrote: > OneHotEncoder will not magically reduce the size of your input. It will > necessarily increase the memory of the input data as long as we are storing > the results in scipy.sparse matrices. The sparse representation will be > less expensive than the dense representation, but it won't be less > expensive than the input. > > On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zara...@gmail.com> > wrote: > >> Hi Joel - >> >> I am also seeing a huge overhead in memory for calling the >> onehot-encoder. I have hacked it by running it splitting by matrix into >> 4-5 smaller matrices (by columns) and then concatenating the results. But, >> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or >> is this to be expected. >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < >> sarah.zara...@gmail.com> wrote: >> >>> Great. Thank you for all your help. >>> >>> Cheers, >>> Sarah >>> >>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com> >>> wrote: >>> >>>> If you specify n_values=[list_of_vals_for_column1, >>>> list_of_vals_for_column2], you should be able to engineer it to how you >>>> want. >>>> >>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek < >>>> sarah.zara...@gmail.com> wrote: >>>> >>>>> If I use the n+1 approach, then I get the correct matrix, except with >>>>> the columns of zeros: >>>>> >>>>> >>> test >>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>>> >>>>> >>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>>> sarah.zara...@gmail.com> wrote: >>>>> >>>>>> Hi Joel - >>>>>> >>>>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>>>> make it match the result when you don't specify them. See below. I used >>>>>> the number of unique levels per column. >>>>>> >>>>>> >>> enc = OneHotEncoder(sparse=False) >>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>>> 2]]) >>>>>> >>> test >>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>>> 2]]) >>>>>> >>> test >>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>> >>>>>> Cheers, >>>>>> Sarah >>>>>> >>>>>> Cheers, >>>>>> Sarah >>>>>> >>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.noth...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> If each input column is encoded as a value from 0 to the (number of >>>>>>> possible values for that column - 1) then n_values for that column >>>>>>> should >>>>>>> be the highest value + 1, which is also the number of levels per column. >>>>>>> Does that make sense? >>>>>>> >>>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit >>>>>>> of code in the one-hot encoder: where the COO matrix is converted to >>>>>>> CSR. I >>>>>>> suspect this was done because most of our ML algorithms perform better >>>>>>> on >>>>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>>>> implementation. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn@python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn