[ 
https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domino Valdano updated MADLIB-1294:
-----------------------------------
    Description: 
The minibatch preprocessor utility used for preparing input tables before 
training accepts  "independent_varname" and "dependent_varname".

I believe the original intention was to have these refer to the names of the 
columns in the input table as well as the output table generated from it.  
However, there is a bug in the implementation where instead of writing out the 
output table columns as \{independent_varname} and \{dependent_varname} the 
curly braces were omitted, meaning whatever names were in the original table 
get wiped out and replaced by the literal strings 'independent_varname' and 
'dependent_varname'.  

This makes little sense for several reasons:

1.) The contents of these columns are data, not variable names, so they end up 
misnamed in the output.

2.) This forces you to pass the argument strings 'independent_varname' and 
'dependent_varname' as the column names of the resulting batched table to the 
fit/train function it's going to be fed into.  In other words, if you're using 
the minibatch preprocessor, then these arguments to fit/train serve no purpose, 
since you always have to pass the same strings rather than a custom name.

3.) You can't pick your own names for these variables, unless you want to 
manually rename them every time after you run the minibatch preprocessor.

Presently, we just finished making a similar minibatch preprocessing utility 
for deep learning support in madlib 1.16.  I'd like to avoid reproducing this 
bug in the new utility, but we don't want them to be incompatible so that means 
we need to either fix both the old and new or neither.  The only issue with 
fixing the old is that it's already been released that way.  So I'm opening 
this bug report as a way of soliciting community feedback on the issue.

If there is anyone who knows of a reason why this should be viewed as a feature 
rather than a bug, or has a need for the functionality to remain the same going 
forward, please comment. Thanks!

  was:
The minibatch preprocessor utility used for preparing input tables before 
training accepts  "independent_varname" and "dependent_varname".

I believe the original intention was to have these refer to the names of the 
columns in the input table as well as the output table generated from it.  
However, there is a bug in the implementation where instead of writing out the 
output table columns as \{independent_varname} and \{dependent_varname} the 
curly braces were omitted, meaning whatever names were in the original table 
get wiped out and replaced by the literal strings 'independent_varname' and 
'dependent_varname'.  

This makes no sense for several reasons:

1.) The contents of these columns are data, not variable names, so they end up 
misnamed in the output.

2.) This forces you to pass the argument strings 'independent_varname' and 
'dependent_varname' as the column names of the resulting batched table to the 
fit/train function it's going to be fed into.  In other words, if you're using 
the minibatch preprocessor, then these arguments to fit/train serve no purpose, 
since you always have to pass the same strings rather than a custom name.

3.) You can't pick your own names for these variables, unless you want to 
manually rename them every time after you run the minibatch preprocessor.

Presently, we just finished making a similar minibatch preprocessing utility 
for deep learning support in madlib 1.16.  I'd like to avoid reproducing this 
bug in the new utility, but we don't want them to be incompatible so that means 
we need to either fix both the old and new or neither.  The only issue with 
fixing the old is that it's already been released that way.  So I'm opening 
this bug report as a way of soliciting community feedback on the issue.

If there is anyone who knows of a reason why this should be viewed as a feature 
rather than a bug, or has a need for the functionality to remain the same going 
forward, please comment. Thanks!


> Field names in output table for minibatch preprocessor
> ------------------------------------------------------
>
>                 Key: MADLIB-1294
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1294
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Major
>             Fix For: v1.16
>
>
> The minibatch preprocessor utility used for preparing input tables before 
> training accepts  "independent_varname" and "dependent_varname".
> I believe the original intention was to have these refer to the names of the 
> columns in the input table as well as the output table generated from it.  
> However, there is a bug in the implementation where instead of writing out 
> the output table columns as \{independent_varname} and \{dependent_varname} 
> the curly braces were omitted, meaning whatever names were in the original 
> table get wiped out and replaced by the literal strings 'independent_varname' 
> and 'dependent_varname'.  
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end 
> up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and 
> 'dependent_varname' as the column names of the resulting batched table to the 
> fit/train function it's going to be fed into.  In other words, if you're 
> using the minibatch preprocessor, then these arguments to fit/train serve no 
> purpose, since you always have to pass the same strings rather than a custom 
> name.
> 3.) You can't pick your own names for these variables, unless you want to 
> manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility 
> for deep learning support in madlib 1.16.  I'd like to avoid reproducing this 
> bug in the new utility, but we don't want them to be incompatible so that 
> means we need to either fix both the old and new or neither.  The only issue 
> with fixing the old is that it's already been released that way.  So I'm 
> opening this bug report as a way of soliciting community feedback on the 
> issue.
> If there is anyone who knows of a reason why this should be viewed as a 
> feature rather than a bug, or has a need for the functionality to remain the 
> same going forward, please comment. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to