Thanks, this makes sense. I will try using the CategoricalEncoder to see the difference. It wouldn't be such a big deal if my input matrix wasn't so large. Thanks again for all your help.
Cheers, Sarah On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman <joel.noth...@gmail.com> wrote: > Yes, the output CSR representation requires: > 1 (dtype) value per entry > 1 int32 per entry > 1 int32 per row > > The intermediate COO representation requires: > 1 (dtype) value per entry > 2 int32 per entry > > So as long as the transformation from COO to CSR is done over the whole > data, it will occupy roughly 5x the input size, which is exactly what you > are experienciong. > > The CategoricalEncoder currently available in the development version of > scikit-learn does not have this problem, but might be slower due to > handling non-integer categories. It will also possibly disappear and be > merged into OneHotEncoder soon (see PR #10523). > > > > On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zara...@gmail.com> > wrote: > >> Yes, of course. What I mean is the I start out with 19 Gigs (initial >> matrix size) or so, it balloons to 100 Gigs *within the encoder function* >> and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't >> exact, but you can see my point. >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.noth...@gmail.com> >> wrote: >> >>> OneHotEncoder will not magically reduce the size of your input. It will >>> necessarily increase the memory of the input data as long as we are storing >>> the results in scipy.sparse matrices. The sparse representation will be >>> less expensive than the dense representation, but it won't be less >>> expensive than the input. >>> >>> On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zara...@gmail.com >>> > wrote: >>> >>>> Hi Joel - >>>> >>>> I am also seeing a huge overhead in memory for calling the >>>> onehot-encoder. I have hacked it by running it splitting by matrix into >>>> 4-5 smaller matrices (by columns) and then concatenating the results. But, >>>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or >>>> is this to be expected. >>>> >>>> Cheers, >>>> Sarah >>>> >>>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < >>>> sarah.zara...@gmail.com> wrote: >>>> >>>>> Great. Thank you for all your help. >>>>> >>>>> Cheers, >>>>> Sarah >>>>> >>>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com> >>>>> wrote: >>>>> >>>>>> If you specify n_values=[list_of_vals_for_column1, >>>>>> list_of_vals_for_column2], you should be able to engineer it to how you >>>>>> want. >>>>>> >>>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek < >>>>>> sarah.zara...@gmail.com> wrote: >>>>>> >>>>>>> If I use the n+1 approach, then I get the correct matrix, except >>>>>>> with the columns of zeros: >>>>>>> >>>>>>> >>> test >>>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>>>>> sarah.zara...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Joel - >>>>>>>> >>>>>>>> Conceptually, that makes sense. But when I assign n_values, I >>>>>>>> can't make it match the result when you don't specify them. See below. >>>>>>>> I >>>>>>>> used the number of unique levels per column. >>>>>>>> >>>>>>>> >>> enc = OneHotEncoder(sparse=False) >>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>>> 0, 2]]) >>>>>>>> >>> test >>>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>>> 0, 2]]) >>>>>>>> >>> test >>>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Sarah >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Sarah >>>>>>>> >>>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < >>>>>>>> joel.noth...@gmail.com> wrote: >>>>>>>> >>>>>>>>> If each input column is encoded as a value from 0 to the (number >>>>>>>>> of possible values for that column - 1) then n_values for that column >>>>>>>>> should be the highest value + 1, which is also the number of levels >>>>>>>>> per >>>>>>>>> column. Does that make sense? >>>>>>>>> >>>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary >>>>>>>>> bit of code in the one-hot encoder: where the COO matrix is converted >>>>>>>>> to >>>>>>>>> CSR. I suspect this was done because most of our ML algorithms perform >>>>>>>>> better on CSR, or else to maintain backwards compatibility with an >>>>>>>>> earlier >>>>>>>>> implementation. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn@python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn@python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn@python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn