[ https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762365#comment-16762365 ]
Himanshu Pandey commented on MADLIB-1294: ----------------------------------------- [~fmcquillan] [~dvaldano] independent_varname in minibatch can accept column name or an expression list : {code:java} Supported Expressions are: ‘ARRAY[x1,x2,x3]’, where x1, x2, and x3 are columns in the source table containing scalar values. A single column in the source table containing an array like ARRAY[1,2,3] or {1,2,3}. {code} If we have to change the *independent_varname* to the actual column name, in case of the single column it will be okay but in case of an expression like this : {code:java} ARRAY[diameter,height,whole,shucked,viscera,shell] {code} what should we change the independent_varname to? For eg: {code} SELECT minibatch_preprocessor('minibatch_preprocessing_input', 'minibatch_preprocessing_out', 'rings', 'ARRAY[diameter,height,whole,shucked,viscera,shell]', NULL, 4, TRUE); {code} 1. Instead of intependent_varname, put all the all the columns in an expression in *{}* like this *{diameter,height,whole,shucked,viscera,shell}* but this won't be meaningful if this list grows. 2. Giving different names based on the number of columns? In case of a single column, it should be the column name and in case of a list something like independent_varnames? Thoughts? Please note *dependent_varname* takes only one column so we are good with this. Thanks > Field names in output table for minibatch preprocessor > ------------------------------------------------------ > > Key: MADLIB-1294 > URL: https://issues.apache.org/jira/browse/MADLIB-1294 > Project: Apache MADlib > Issue Type: Bug > Components: Module: Utilities > Reporter: Domino Valdano > Assignee: Domino Valdano > Priority: Minor > Fix For: v1.16 > > > The minibatch preprocessor utility used for preparing input tables before > training accepts "independent_varname" and "dependent_varname" as parameters. > I believe the original intention was to have these refer to the names of the > columns in the input table as well as the output table generated from it. > However, there is a bug in the implementation where instead of writing out > the output table columns as \{independent_varname} and \{dependent_varname} > the curly braces were omitted, meaning whatever names were in the original > table get wiped out and replaced by the literal strings 'independent_varname' > and 'dependent_varname'. > This makes little sense for several reasons: > 1.) The contents of these columns are data, not variable names, so they end > up misnamed in the output. > 2.) This forces you to pass the argument strings 'independent_varname' and > 'dependent_varname' as the column names of the resulting batched table to the > fit/train function it's going to be fed into. In other words, if you're > using the minibatch preprocessor, then these arguments to fit/train serve no > purpose, since you always have to pass the same strings rather than a custom > name. > 3.) You can't pick your own names for these variables, unless you want to > manually rename them every time after you run the minibatch preprocessor. > Presently, we just finished making a similar minibatch preprocessing utility > for deep learning support in madlib 1.16. I'd like to avoid reproducing this > bug in the new utility, but we don't want them to be incompatible so that > means we need to either fix both the old and new or neither. The only issue > with fixing the old is that it's already been released that way. So I'm > opening this bug report as a way of soliciting community feedback on the > issue. > If there is anyone who knows of a reason why this should be viewed as a > feature rather than a bug, or has a need for the functionality to remain the > same going forward, please comment. Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)