Hi Satoshi, Array output for *create_indicator_variables* would be quite helpful when number of categories is large and the svec representation would be ideal for it. There might be similar implications for *pivoting*, but we can keep that as future discussion.
I'm curious about how you're using the indicator variables - svec is not widely supported in MADlib (yet) and might not give much benefit after the encoding is complete. Best, Rahul On Sun, Aug 7, 2016 at 1:50 AM, Satoshi Nagayasu <[email protected]> wrote: > Hi, > > I'm trying create_indicator_variables() to encode categorical variables. > > https://madlib.incubator.apache.org/docs/latest/group__ > grp__data__prep.html > > And I found that PostgreSQL had a limitation of maximum number of variables > in SELECT list (called target list in PostgreSQL), up to 1664. > > You may see this error when you have more than 1664 categories in your > variable. > > spiexceptions.ProgramLimitExceeded: target lists can have at most 1664 > entries > > Now, I'm considering using PostgreSQL arrays to contain indicators instead > of > allocating single column per category. > > If create_indicator_variables() supports arrays as its output, it > allows us to deal > with categorical variables which have more than 1664 categories. > And of course, I would like to use the sparse vector for it to compress > them. > > https://madlib.incubator.apache.org/docs/latest/group__grp__svec.html > > Seems good to you? Any comments? > > Regards, > -- > Satoshi Nagayasu <[email protected]> > -- --------------------------------------------------------- Rahul Iyer Principal software engineer | Predictive Analytics *Pivotal**A new platform for a new era*
