You guys may have already seen this, but linking just in case: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <[email protected]> wrote: > +Vatsan for his thoughts as well! > > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <[email protected]> wrote: > >> Also agree that double-quoted column names are not ideal. In addition to >> the net-new features described in this thread, it'd be nice to see >> non-double-quoted output as default behavior in the >> existing create_indicator_variables() function. >> >> Thanks, >> Woo >> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <[email protected]> wrote: >> >>> I like the one-hot encoded feature. Another variant of this idea would >>> be an "all other" variable (distinct from the reference class) that >>> contains occurrences of the less frequent category types. In both of these >>> scenarios, the threshold for 'less frequent' could be user-supplied. >>> >>> Thanks, >>> Woo >>> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <[email protected]> >>> wrote: >>> >>>> An alternative to dropping is to assign the less frequent values to the >>>> reference i.e. all one-hot encoded features will be 0. >>>> Also important to note: total runtime will increase with this option >>>> since >>>> we'll have to compute the exact frequency distribution. >>>> >>>> Another suggested change is to call this function 'one_hot_encoding' >>>> since >>>> that is the output here (similar to sklearn's OneHotEncoder >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr >>>> eprocessing.OneHotEncoder.html>). >>>> We can keep the current name as a deprecated alias till 2.0 is released. >>>> >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan < >>>> [email protected]> >>>> wrote: >>>> >>>> > Jarrod, >>>> > >>>> > Just trying to write up detailed requirements. How would you see >>>> this one >>>> > working? >>>> > >>>> > "2) Option to dummy code only the top n most frequently occurring >>>> values in >>>> > any column" >>>> > >>>> > With 1 column I can picture it, you would drop the rows with the less >>>> > frequently occurring values and end up with a smaller table. But >>>> what if >>>> > you are encoding multiple rows? Would you want a per row >>>> specification >>>> > of n? i.e., top 3 values for column x, top 10 values for column y? >>>> If you >>>> > did this then your result set might include low frequency values for >>>> column >>>> > x (not in top 3) because they are in the top 10 for column y - this >>>> might >>>> > be confusing. >>>> > >>>> > Frank >>>> > >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan < >>>> [email protected]> >>>> > wrote: >>>> > >>>> >> great, thanks for the additional information >>>> >> >>>> >> Frank >>>> >> >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <[email protected] >>>> > >>>> >> wrote: >>>> >> >>>> >>> IMO >>>> >>> >>>> >>> 1) Option to define resulting column names. Please see pdltools >>>> >>> implementation - the ability to pass in a function is especially >>>> useful ( >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >>>> >>> 2) Option to dummy code only the top n most frequently occurring >>>> values >>>> >>> in >>>> >>> any column >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >>>> >>> pivotcol_val2 >>>> >>> ...) instead of values in column names + secondary mapping table >>>> >>> 4) Option to exclude original column from results table >>>> >>> >>>> >>> (1) & (2) are much higher priority than (3) & (4). >>>> >>> >>>> >>> Agreed that these could also be applied to Pivoting (especially 1). >>>> >>> >>>> >>> >>>> >>> >>>> >>> Jarrod Vawdrey >>>> >>> Sr. Data Scientist >>>> >>> Data Science & Engineering | Pivotal >>>> >>> (650) 315-8905 >>>> >>> https://pivotal.io/ >>>> >>> >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < >>>> [email protected]> >>>> >>> wrote: >>>> >>> >>>> >>> > Thanks for those suggestions, Jarrod. They all sound pretty >>>> useful - >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in >>>> the >>>> >>> order >>>> >>> > of priority as you see it? >>>> >>> > >>>> >>> > Also it seems like some of these could be applied to the Pivot >>>> >>> function as >>>> >>> > well, e.g., UDF for column naming. >>>> >>> > >>>> >>> > Frank >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < >>>> [email protected]> >>>> >>> > wrote: >>>> >>> > >>>> >>> >> Hey Frank, >>>> >>> >> >>>> >>> >> How are special character values handled today? It is often not >>>> ideal >>>> >>> to >>>> >>> >> end up with column names that require double quotes to call due >>>> to >>>> >>> >> downstream scripts. >>>> >>> >> >>>> >>> >> A couple of features that would be useful >>>> >>> >> >>>> >>> >> * Option to define resulting column names. Please see pdltools >>>> >>> >> implementation - the ability to pass in a function is especially >>>> >>> useful ( >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0 >>>> 1.html) >>>> >>> >> * Option to dummy code only the top n most frequently occurring >>>> >>> values in >>>> >>> >> any column >>>> >>> >> * Option to exclude original column from results table >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary >>>> >>> mapping >>>> >>> >> table >>>> >>> >> >>>> >>> >> Thank you >>>> >>> >> >>>> >>> >> Jarrod Vawdrey >>>> >>> >> Sr. Data Scientist >>>> >>> >> Data Science & Engineering | Pivotal >>>> >>> >> (650) 315-8905 >>>> >>> >> https://pivotal.io/ >>>> >>> >> >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >>>> >>> [email protected]> >>>> >>> >> wrote: >>>> >>> >> >>>> >>> >>> For the module encoding categorical variables >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >>>> >>> >>> ata__prep.html >>>> >>> >>> does anyone have any suggestions on improvements that we could >>>> make? >>>> >>> >>> >>>> >>> >>> Here is a video on how encoding categorical variables works for >>>> >>> those not >>>> >>> >>> familiar with it >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >>>> >>> >>> >>>> >>> >> >>>> >>> >> >>>> >>> > >>>> >>> >>>> >> >>>> >> >>>> > >>>> >>> >>> >> >
