[jira] [Commented] (MADLIB-1294) Field names in output table for minibatch preprocessor

Himanshu Pandey (JIRA) Wed, 06 Feb 2019 20:57:29 -0800


    [ 
https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762365#comment-16762365
 ]


Himanshu Pandey commented on MADLIB-1294:
-----------------------------------------

[~fmcquillan] [~dvaldano]

independent_varname in minibatch can accept column name or an expression list :
{code:java}
Supported Expressions are: 

‘ARRAY[x1,x2,x3]’, where x1, x2, and x3 are columns in the source table 
containing scalar values.
A single column in the source table containing an array like ARRAY[1,2,3] or 
{1,2,3}.
{code}
If we have to change the *independent_varname* to the actual column name, in 
case of the single column it will be okay 
but in case of an expression like this :
{code:java}
 ARRAY[diameter,height,whole,shucked,viscera,shell]  {code}
what should we change the independent_varname to?

 

For eg: 

{code}

SELECT minibatch_preprocessor('minibatch_preprocessing_input',
'minibatch_preprocessing_out',
'rings',
'ARRAY[diameter,height,whole,shucked,viscera,shell]',
NULL,
4,
TRUE);

{code}

 

1. Instead of intependent_varname, put all the  all the columns in an 
expression in *{}* like this

*{diameter,height,whole,shucked,viscera,shell}*

but this won't be meaningful if this list grows.

 

2. Giving different names based on the number of columns? In case of a single 
column, it should be the column name and in case of a list something like 
independent_varnames?

 

Thoughts?

 

Please note *dependent_varname* takes only one column so we are good with this. 

 

 

Thanks

 

> Field names in output table for minibatch preprocessor
> ------------------------------------------------------
>
>                 Key: MADLIB-1294
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1294
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Minor
>             Fix For: v1.16
>
>
> The minibatch preprocessor utility used for preparing input tables before 
> training accepts  "independent_varname" and "dependent_varname" as parameters.
> I believe the original intention was to have these refer to the names of the 
> columns in the input table as well as the output table generated from it.  
> However, there is a bug in the implementation where instead of writing out 
> the output table columns as \{independent_varname} and \{dependent_varname} 
> the curly braces were omitted, meaning whatever names were in the original 
> table get wiped out and replaced by the literal strings 'independent_varname' 
> and 'dependent_varname'.  
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end 
> up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and 
> 'dependent_varname' as the column names of the resulting batched table to the 
> fit/train function it's going to be fed into.  In other words, if you're 
> using the minibatch preprocessor, then these arguments to fit/train serve no 
> purpose, since you always have to pass the same strings rather than a custom 
> name.
> 3.) You can't pick your own names for these variables, unless you want to 
> manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility 
> for deep learning support in madlib 1.16.  I'd like to avoid reproducing this 
> bug in the new utility, but we don't want them to be incompatible so that 
> means we need to either fix both the old and new or neither.  The only issue 
> with fixing the old is that it's already been released that way.  So I'm 
> opening this bug report as a way of soliciting community feedback on the 
> issue.
> If there is anyone who knows of a reason why this should be viewed as a 
> feature rather than a bug, or has a need for the functionality to remain the 
> same going forward, please comment. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1294) Field names in output table for minibatch preprocessor

Reply via email to