Jarrod, Just trying to write up detailed requirements. How would you see this one working?
"2) Option to dummy code only the top n most frequently occurring values in any column" With 1 column I can picture it, you would drop the rows with the less frequently occurring values and end up with a smaller table. But what if you are encoding multiple rows? Would you want a per row specification of n? i.e., top 3 values for column x, top 10 values for column y? If you did this then your result set might include low frequency values for column x (not in top 3) because they are in the top 10 for column y - this might be confusing. Frank On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <[email protected]> wrote: > great, thanks for the additional information > > Frank > > On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <[email protected]> > wrote: > >> IMO >> >> 1) Option to define resulting column names. Please see pdltools >> implementation - the ability to pass in a function is especially useful ( >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> 2) Option to dummy code only the top n most frequently occurring values in >> any column >> 3) Option to create numeric column names (E.g. pivotcol_val1, >> pivotcol_val2 >> ...) instead of values in column names + secondary mapping table >> 4) Option to exclude original column from results table >> >> (1) & (2) are much higher priority than (3) & (4). >> >> Agreed that these could also be applied to Pivoting (especially 1). >> >> >> >> Jarrod Vawdrey >> Sr. Data Scientist >> Data Science & Engineering | Pivotal >> (650) 315-8905 >> https://pivotal.io/ >> >> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <[email protected]> >> wrote: >> >> > Thanks for those suggestions, Jarrod. They all sound pretty useful - >> > would you mind taking a crack at numbering them 1,2,3... etc, in the >> order >> > of priority as you see it? >> > >> > Also it seems like some of these could be applied to the Pivot function >> as >> > well, e.g., UDF for column naming. >> > >> > Frank >> > >> > >> > >> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <[email protected]> >> > wrote: >> > >> >> Hey Frank, >> >> >> >> How are special character values handled today? It is often not ideal >> to >> >> end up with column names that require double quotes to call due to >> >> downstream scripts. >> >> >> >> A couple of features that would be useful >> >> >> >> * Option to define resulting column names. Please see pdltools >> >> implementation - the ability to pass in a function is especially >> useful ( >> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> >> * Option to dummy code only the top n most frequently occurring values >> in >> >> any column >> >> * Option to exclude original column from results table >> >> * Option to create numeric column names (E.g. pivotcol_val1, >> >> pivotcol_val2 ...) instead of values in column names + secondary >> mapping >> >> table >> >> >> >> Thank you >> >> >> >> Jarrod Vawdrey >> >> Sr. Data Scientist >> >> Data Science & Engineering | Pivotal >> >> (650) 315-8905 >> >> https://pivotal.io/ >> >> >> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >> [email protected]> >> >> wrote: >> >> >> >>> For the module encoding categorical variables >> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >> >>> ata__prep.html >> >>> does anyone have any suggestions on improvements that we could make? >> >>> >> >>> Here is a video on how encoding categorical variables works for those >> not >> >>> familiar with it >> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >> >>> >> >> >> >> >> > >> > >
