Here is the JIRA with attached requirements doc. https://issues.apache.org/jira/browse/MADLIB-1038
Please put your comments in the JIRA. There are still some outstanding questions to be puzzled out. Frank On Fri, Oct 28, 2016 at 3:04 PM, Frank McQuillan <[email protected]> wrote: > Yes thanks Vatsan we have been looking at that. > > On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <[email protected]> wrote: > >> You guys may have already seen this, but linking just in case: >> http://pandas.pydata.org/pandas-docs/stable/generated/pandas >> .get_dummies.html >> >> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <[email protected]> wrote: >> >> > +Vatsan for his thoughts as well! >> > >> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <[email protected]> wrote: >> > >> >> Also agree that double-quoted column names are not ideal. In addition >> to >> >> the net-new features described in this thread, it'd be nice to see >> >> non-double-quoted output as default behavior in the >> >> existing create_indicator_variables() function. >> >> >> >> Thanks, >> >> Woo >> >> >> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <[email protected]> >> wrote: >> >> >> >>> I like the one-hot encoded feature. Another variant of this idea >> would >> >>> be an "all other" variable (distinct from the reference class) that >> >>> contains occurrences of the less frequent category types. In both of >> these >> >>> scenarios, the threshold for 'less frequent' could be user-supplied. >> >>> >> >>> Thanks, >> >>> Woo >> >>> >> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <[email protected]> >> >>> wrote: >> >>> >> >>>> An alternative to dropping is to assign the less frequent values to >> the >> >>>> reference i.e. all one-hot encoded features will be 0. >> >>>> Also important to note: total runtime will increase with this option >> >>>> since >> >>>> we'll have to compute the exact frequency distribution. >> >>>> >> >>>> Another suggested change is to call this function 'one_hot_encoding' >> >>>> since >> >>>> that is the output here (similar to sklearn's OneHotEncoder >> >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr >> >>>> eprocessing.OneHotEncoder.html>). >> >>>> We can keep the current name as a deprecated alias till 2.0 is >> released. >> >>>> >> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan < >> >>>> [email protected]> >> >>>> wrote: >> >>>> >> >>>> > Jarrod, >> >>>> > >> >>>> > Just trying to write up detailed requirements. How would you see >> >>>> this one >> >>>> > working? >> >>>> > >> >>>> > "2) Option to dummy code only the top n most frequently occurring >> >>>> values in >> >>>> > any column" >> >>>> > >> >>>> > With 1 column I can picture it, you would drop the rows with the >> less >> >>>> > frequently occurring values and end up with a smaller table. But >> >>>> what if >> >>>> > you are encoding multiple rows? Would you want a per row >> >>>> specification >> >>>> > of n? i.e., top 3 values for column x, top 10 values for column y? >> >>>> If you >> >>>> > did this then your result set might include low frequency values >> for >> >>>> column >> >>>> > x (not in top 3) because they are in the top 10 for column y - this >> >>>> might >> >>>> > be confusing. >> >>>> > >> >>>> > Frank >> >>>> > >> >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan < >> >>>> [email protected]> >> >>>> > wrote: >> >>>> > >> >>>> >> great, thanks for the additional information >> >>>> >> >> >>>> >> Frank >> >>>> >> >> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey < >> [email protected] >> >>>> > >> >>>> >> wrote: >> >>>> >> >> >>>> >>> IMO >> >>>> >>> >> >>>> >>> 1) Option to define resulting column names. Please see pdltools >> >>>> >>> implementation - the ability to pass in a function is especially >> >>>> useful ( >> >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0 >> 1.html) >> >>>> >>> 2) Option to dummy code only the top n most frequently occurring >> >>>> values >> >>>> >>> in >> >>>> >>> any column >> >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >> >>>> >>> pivotcol_val2 >> >>>> >>> ...) instead of values in column names + secondary mapping table >> >>>> >>> 4) Option to exclude original column from results table >> >>>> >>> >> >>>> >>> (1) & (2) are much higher priority than (3) & (4). >> >>>> >>> >> >>>> >>> Agreed that these could also be applied to Pivoting (especially >> 1). >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> Jarrod Vawdrey >> >>>> >>> Sr. Data Scientist >> >>>> >>> Data Science & Engineering | Pivotal >> >>>> >>> (650) 315-8905 >> >>>> >>> https://pivotal.io/ >> >>>> >>> >> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < >> >>>> [email protected]> >> >>>> >>> wrote: >> >>>> >>> >> >>>> >>> > Thanks for those suggestions, Jarrod. They all sound pretty >> >>>> useful - >> >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, >> in >> >>>> the >> >>>> >>> order >> >>>> >>> > of priority as you see it? >> >>>> >>> > >> >>>> >>> > Also it seems like some of these could be applied to the Pivot >> >>>> >>> function as >> >>>> >>> > well, e.g., UDF for column naming. >> >>>> >>> > >> >>>> >>> > Frank >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < >> >>>> [email protected]> >> >>>> >>> > wrote: >> >>>> >>> > >> >>>> >>> >> Hey Frank, >> >>>> >>> >> >> >>>> >>> >> How are special character values handled today? It is often >> not >> >>>> ideal >> >>>> >>> to >> >>>> >>> >> end up with column names that require double quotes to call >> due >> >>>> to >> >>>> >>> >> downstream scripts. >> >>>> >>> >> >> >>>> >>> >> A couple of features that would be useful >> >>>> >>> >> >> >>>> >>> >> * Option to define resulting column names. Please see pdltools >> >>>> >>> >> implementation - the ability to pass in a function is >> especially >> >>>> >>> useful ( >> >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0 >> >>>> 1.html) >> >>>> >>> >> * Option to dummy code only the top n most frequently >> occurring >> >>>> >>> values in >> >>>> >>> >> any column >> >>>> >>> >> * Option to exclude original column from results table >> >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >> >>>> >>> >> pivotcol_val2 ...) instead of values in column names + >> secondary >> >>>> >>> mapping >> >>>> >>> >> table >> >>>> >>> >> >> >>>> >>> >> Thank you >> >>>> >>> >> >> >>>> >>> >> Jarrod Vawdrey >> >>>> >>> >> Sr. Data Scientist >> >>>> >>> >> Data Science & Engineering | Pivotal >> >>>> >>> >> (650) 315-8905 >> >>>> >>> >> https://pivotal.io/ >> >>>> >>> >> >> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >> >>>> >>> [email protected]> >> >>>> >>> >> wrote: >> >>>> >>> >> >> >>>> >>> >>> For the module encoding categorical variables >> >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >> >>>> >>> >>> ata__prep.html >> >>>> >>> >>> does anyone have any suggestions on improvements that we >> could >> >>>> make? >> >>>> >>> >>> >> >>>> >>> >>> Here is a video on how encoding categorical variables works >> for >> >>>> >>> those not >> >>>> >>> >>> familiar with it >> >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >> >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >> >>>> >>> >>> >> >>>> >>> >> >> >>>> >>> >> >> >>>> >>> > >> >>>> >>> >> >>>> >> >> >>>> >> >> >>>> > >> >>>> >> >>> >> >>> >> >> >> > >> > >
