[
https://issues.apache.org/jira/browse/MADLIB-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808958#comment-16808958
]
Frank McQuillan commented on MADLIB-1314:
-----------------------------------------
https://github.com/apache/madlib/pull/361
LGTM
> Add optional num_classes param for minibatch preprocessor for DL
> ----------------------------------------------------------------
>
> Key: MADLIB-1314
> URL: https://issues.apache.org/jira/browse/MADLIB-1314
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Deep Learning, Module: Utilities
> Reporter: Nandish Jayaram
> Priority: Major
> Fix For: v1.16
>
>
> The current `minibatch_preprocessor_dl` module looks at the input table to
> find the number of distinct categories (class values) for the dependent
> variable, and uses that number as the size of the one-hot-encoded array. This
> could lead a failure in madlib_keras fit function if the `num_classes`
> defined in the architecture is a number greater/different than the size of
> the one hot encoded array.
> This could be a fairly common scenario, for example:
> Say original data set is places 350, but we decide to sample a subset. That
> subset may not have all 350 classes (assume it has only 10 classes in it),
> but the model we have already defined is for places 350 (so num_classes there
> would be specified as 350, and the final layer would have that many units).
> So we will have to change the model architecture to work with this sampled
> dataset if we do not support this feature where we create one-hot encoded
> vector of size 350 despite finding only 10 class values in the input dataset.
> Acceptance:
> 1. Add optional `num_classes` param of type integer.
> 1. one hot encoded array must be of size `num_classes` if specified, else use
> the distinct number of class values for it.
> 1. Fail if `num_classes < distinct class values found in dataset`.
> 1. `class_values` column in summary table must have `NULL` as the entry for
> class values that do not exist in the input table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)