Hi Joel - I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <sarah.zara...@gmail.com> wrote: > Great. Thank you for all your help. > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com> > wrote: > >> If you specify n_values=[list_of_vals_for_column1, >> list_of_vals_for_column2], you should be able to engineer it to how you >> want. >> >> On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zara...@gmail.com> >> wrote: >> >>> If I use the n+1 approach, then I get the correct matrix, except with >>> the columns of zeros: >>> >>> >>> test >>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>> >>> >>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>> sarah.zara...@gmail.com> wrote: >>> >>>> Hi Joel - >>>> >>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>> make it match the result when you don't specify them. See below. I used >>>> the number of unique levels per column. >>>> >>>> >>> enc = OneHotEncoder(sparse=False) >>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>> 2]]) >>>> >>> test >>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>> 2]]) >>>> >>> test >>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>> >>>> Cheers, >>>> Sarah >>>> >>>> Cheers, >>>> Sarah >>>> >>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.noth...@gmail.com> >>>> wrote: >>>> >>>>> If each input column is encoded as a value from 0 to the (number of >>>>> possible values for that column - 1) then n_values for that column should >>>>> be the highest value + 1, which is also the number of levels per column. >>>>> Does that make sense? >>>>> >>>>> Actually, I've realised there's a somewhat slow and unnecessary bit of >>>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>>> suspect this was done because most of our ML algorithms perform better on >>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>> implementation. >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn