[ https://issues.apache.org/jira/browse/MADLIB-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1303: ------------------------------------ Priority: Minor (was: Major) > Add 1-hot encoding to dependent variable in mini-batch preprocessor for images > ------------------------------------------------------------------------------ > > Key: MADLIB-1303 > URL: https://issues.apache.org/jira/browse/MADLIB-1303 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Minor > Fix For: v1.16 > > > Story > As a data scientist, I want to have the mini-batch preprocessor 1-hot encode > the dependent variable so that I don't need to do it myself. This applies to > all types: boolean and character types such as text, char and varchar, & > integers and floats. > If the dependent variable is already an array, then we assume it is already > 1-hot encoded and we just cast it to int[] and pass it along. > We can remove the param `dependent_offset (optional)` from the current > interface since 1-hot encoding is the more general solution. > Open questions > 1) Q: Can we just use the exact same 1-hot encoding as in > http://madlib.apache.org/docs/latest/group__grp__minibatch__preprocessing.html > ??? > i.e., add the param `one_hot_encode_int_dep_var (optional)` > then we could use the same code that is already written and tested and such? > A: we can re-use the code to the extent possible, but we do not need this > param. > 2) Q: In the case where the dependent variable is already 1-hot encoded, this > means need to support array input for dependent variable. Also, should we > just pass it thru or check for an array only with 1's and 0's? > A: We will check first row but it does not guarantee all rows are correct. > 3) Q: How to handle float? If user wants to encode float values for some > reason, they could cast them to text first. Or just pass them along? > A: If scalar float, we 1-hot encode (could be a valid case). If float[], we > cast to int[]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)