Nandish Jayaram created MADLIB-1314:
---------------------------------------
Summary: Add optional num_classes param for minibatch preprocessor
for DL
Key: MADLIB-1314
URL: https://issues.apache.org/jira/browse/MADLIB-1314
Project: Apache MADlib
Issue Type: New Feature
Components: Deep Learning, Module: Utilities
Reporter: Nandish Jayaram
Fix For: v1.16
The current `minibatch_preprocessor_dl` module looks at the input table to find
the number of distinct categories (class values) for the dependent variable,
and uses that number as the size of the one-hot-encoded array. This could lead
a failure in madlib_keras fit function if the `num_classes` defined in the
architecture is a number greater/different than the size of the one hot encoded
array.
This could be a fairly common scenario, for example:
Say original data set is places 350, but we decide to sample a subset. That
subset may not have all 350 classes (assume it has only 10 classes in it), but
the model we have already defined is for places 350 (so num_classes there would
be specified as 350, and the final layer would have that many units). So we
will have to change the model architecture to work with this sampled dataset if
we do not support this feature where we create one-hot encoded vector of size
350 despite finding only 10 class values in the input dataset.
Acceptance:
1. Add optional `num_classes` param of type integer.
1. one hot encoded array must be of size `num_classes` if specified, else use
the distinct number of class values for it.
1. Fail if `num_classes < distinct class values found in dataset`.
1. `class_values` column in summary table must have `NULL` as the entry for
class values that do not exist in the input table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)