[
https://issues.apache.org/jira/browse/MADLIB-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Domino Valdano updated MADLIB-1294:
-----------------------------------
Description:
The minibatch preprocessor utility used for preparing input tables before
training accepts "independent_varname" and "dependent_varname" as parameters.
I believe the original intention was to have these refer to the names of the
columns in the input table as well as the output table generated from it.
However, there is a bug in the implementation where instead of writing out the
output table columns as \{independent_varname} and \{dependent_varname} the
curly braces were omitted, meaning whatever names were in the original table
get wiped out and replaced by the literal strings 'independent_varname' and
'dependent_varname'.
This makes little sense for several reasons:
1.) The contents of these columns are data, not variable names, so they end up
misnamed in the output.
2.) This forces you to pass the argument strings 'independent_varname' and
'dependent_varname' as the column names of the resulting batched table to the
fit/train function it's going to be fed into. In other words, if you're using
the minibatch preprocessor, then these arguments to fit/train serve no purpose,
since you always have to pass the same strings rather than a custom name.
3.) You can't pick your own names for these variables, unless you want to
manually rename them every time after you run the minibatch preprocessor.
Presently, we just finished making a similar minibatch preprocessing utility
for deep learning support in madlib 1.16. I'd like to avoid reproducing this
bug in the new utility, but we don't want them to be incompatible so that means
we need to either fix both the old and new or neither. The only issue with
fixing the old is that it's already been released that way. So I'm opening
this bug report as a way of soliciting community feedback on the issue.
If there is anyone who knows of a reason why this should be viewed as a feature
rather than a bug, or has a need for the functionality to remain the same going
forward, please comment. Thanks!
was:
The minibatch preprocessor utility used for preparing input tables before
training accepts "independent_varname" and "dependent_varname".
I believe the original intention was to have these refer to the names of the
columns in the input table as well as the output table generated from it.
However, there is a bug in the implementation where instead of writing out the
output table columns as \{independent_varname} and \{dependent_varname} the
curly braces were omitted, meaning whatever names were in the original table
get wiped out and replaced by the literal strings 'independent_varname' and
'dependent_varname'.
This makes little sense for several reasons:
1.) The contents of these columns are data, not variable names, so they end up
misnamed in the output.
2.) This forces you to pass the argument strings 'independent_varname' and
'dependent_varname' as the column names of the resulting batched table to the
fit/train function it's going to be fed into. In other words, if you're using
the minibatch preprocessor, then these arguments to fit/train serve no purpose,
since you always have to pass the same strings rather than a custom name.
3.) You can't pick your own names for these variables, unless you want to
manually rename them every time after you run the minibatch preprocessor.
Presently, we just finished making a similar minibatch preprocessing utility
for deep learning support in madlib 1.16. I'd like to avoid reproducing this
bug in the new utility, but we don't want them to be incompatible so that means
we need to either fix both the old and new or neither. The only issue with
fixing the old is that it's already been released that way. So I'm opening
this bug report as a way of soliciting community feedback on the issue.
If there is anyone who knows of a reason why this should be viewed as a feature
rather than a bug, or has a need for the functionality to remain the same going
forward, please comment. Thanks!
> Field names in output table for minibatch preprocessor
> ------------------------------------------------------
>
> Key: MADLIB-1294
> URL: https://issues.apache.org/jira/browse/MADLIB-1294
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Utilities
> Reporter: Domino Valdano
> Assignee: Domino Valdano
> Priority: Major
> Fix For: v1.16
>
>
> The minibatch preprocessor utility used for preparing input tables before
> training accepts "independent_varname" and "dependent_varname" as parameters.
> I believe the original intention was to have these refer to the names of the
> columns in the input table as well as the output table generated from it.
> However, there is a bug in the implementation where instead of writing out
> the output table columns as \{independent_varname} and \{dependent_varname}
> the curly braces were omitted, meaning whatever names were in the original
> table get wiped out and replaced by the literal strings 'independent_varname'
> and 'dependent_varname'.
> This makes little sense for several reasons:
> 1.) The contents of these columns are data, not variable names, so they end
> up misnamed in the output.
> 2.) This forces you to pass the argument strings 'independent_varname' and
> 'dependent_varname' as the column names of the resulting batched table to the
> fit/train function it's going to be fed into. In other words, if you're
> using the minibatch preprocessor, then these arguments to fit/train serve no
> purpose, since you always have to pass the same strings rather than a custom
> name.
> 3.) You can't pick your own names for these variables, unless you want to
> manually rename them every time after you run the minibatch preprocessor.
> Presently, we just finished making a similar minibatch preprocessing utility
> for deep learning support in madlib 1.16. I'd like to avoid reproducing this
> bug in the new utility, but we don't want them to be incompatible so that
> means we need to either fix both the old and new or neither. The only issue
> with fixing the old is that it's already been released that way. So I'm
> opening this bug report as a way of soliciting community feedback on the
> issue.
> If there is anyone who knows of a reason why this should be viewed as a
> feature rather than a bug, or has a need for the functionality to remain the
> same going forward, please comment. Thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)