Two cases to consider here:
1. Batched model creation - an algorithm which needs multiple passes over
the same dataset, and
2. an online model creation where just one training record is processed at
a time.

The second case seems to be fitting for apex. I think the proposal is for
the first case - batched model creation. In this case, there may be two
time consuming processes involved. First is the preparation of training
data by apex which might take a fair amount of time and without which h2o
may not be able to proceed. Second is the h2o training itself. Passing the
data through some external medium (hdfs) may be the most obvious thing to
do. What Siyuan is proposing is a bridge where we bypass the external
medium.

However, what are we achieving by doing this? Is it that we want a cleaner
approach than passing data via hdfs?
On 21-Oct-2015 12:16 am, "Sandesh Hegde" <[email protected]> wrote:

> This feature will be useful only if training can be done at scale. There
> may be some models which can be built incrementally, do you know any ?
>
> On Tue, Oct 20, 2015 at 11:37 AM Siyuan Hua <[email protected]>
> wrote:
>
> > Hi Sandesh,
> >
> > This is not supposed to scale up the H2O itself. It's just about a bridge
> > between h2o and Apex. Nowadays if you want to use apex to prepare the
> data
> > for H2O. You have to output data to some file(ex hdfs) And then manually
> > start h2o to build the model.
> > With this bridge you can build one pipeline to do the whole thing.
> >
> >
> > Siyuan
> >
> > On Tue, Oct 20, 2015 at 10:56 AM, Sandesh Hegde <[email protected]
> >
> > wrote:
> >
> > > How do you propose to handle the scalability required for H2o model
> > > creation ?
> > >
> > > On Tue, Oct 20, 2015 at 9:58 AM Siyuan Hua <[email protected]>
> > wrote:
> > >
> > > > In ML model training, we discovered a pattern that apex can be used
> to
> > > > process raw data to feature data, then H2O takes the feature data
> into
> > > it's
> > > > model train engine to train the model.
> > > >
> > > > But there is a gap in between 2 pipelines, I have a proposal that we
> > > could
> > > > create some operator which feed the processed data directly into H2O
> or
> > > > maybe start a container for H2O and throw data into it. In that way,
> we
> > > > could build a continuous online model train pipeline.
> > > >
> > > > I've created a jira here
> https://malhar.atlassian.net/browse/MLHR-1875
> > > >
> > > > Feel free to throw any thoughts
> > > >
> > > > Best,
> > > > Siyuan
> > > >
> > >
> >
>

Reply via email to