Re: Problem with ML pipeline

Till Rohrmann Mon, 08 Jun 2015 01:02:58 -0700

You're right Felix. You need to provide the `FitOperation` and
`PredictOperation` for the `Predictor` you want to use and the
`FitOperation` and `TransformOperation` for all `Transformer`s you want to
chain in front of the `Predictor`.


Specifying which features to take could be a solution. However, then you're
always carrying data along which is not needed. Especially for large scale
data, this might be prohibitive expensive. I guess the more efficient
solution would be to assign an ID and later join with the removed feature
elements.

Cheers,
Till

On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[email protected]> wrote:

> A more general approach would be to take as input which indices of the
> vector to consider as features. After that, the vector can be returned as
> such and user can do what they  wish with the non-feature values. This
> wouldn't need extending the predict operation, instead this can be
> specified in the model itself using a set parameter function. Or perhaps a
> better approach is to just take this input in the predict operation.
>
> Cheers!
> Sachin
> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[email protected]> wrote:
>
> > Probably we also need it for the other classes of the pipeline as well,
> in
> > order to be able to pass the ID through the whole pipeline.
> >
> > Best regards,
> > Felix
> >  Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[email protected]
> >:
> >
> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
> Int),
> > > (LabeledVector, Int)] value with T <: Vector in the scope where you
> call
> > > the predict operation.
> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[email protected]>
> wrote:
> > >
> > > > That would be great. I like the special predict operation better
> > because
> > > it
> > > > is only in some cases necessary to return the id. The special predict
> > > > Operation would save this overhead.
> > > >
> > > > Best regards,
> > > > Felix
> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > [email protected]
> > > > >:
> > > >
> > > > > I see your problem. One way to solve the problem is to implement a
> > > > special
> > > > > PredictOperation which takes a tuple (id, vector) and returns a
> tuple
> > > > (id,
> > > > > labeledVector). You can take a look at the implementation for the
> > > vector
> > > > > prediction operation.
> > > > >
> > > > > But we can also discuss about adding an ID field to the Vector
> type.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have the following use case: I want to to regression for a
> > > timeseries
> > > > > > dataset like:
> > > > > >
> > > > > > id, x1, x2, ..., xn, y
> > > > > >
> > > > > > id = point in time
> > > > > > x = features
> > > > > > y = target value
> > > > > >
> > > > > > In the Flink frame work I would map this to a LabeledVector (y,
> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > > > >
> > > > > > When I apply finally the predict() method I get a LabeledVector
> > > > > > (y_predicted, DenseVector(x)).
> > > > > >
> > > > > > Now my problem is that I would like to plot the predicted target
> > > value
> > > > > > according to its time.
> > > > > >
> > > > > > What I have to do now is:
> > > > > >
> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> > > > > >
> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > > > >
> > > > > > This is really a cumbersome process for such an simple thing. Is
> > > there
> > > > > any
> > > > > > approach which makes this more simple. If not, can we extend the
> ML
> > > > API.
> > > > > to
> > > > > > allow ids?
> > > > > >
> > > > > > Best regards,
> > > > > > Felix
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Problem with ML pipeline

Reply via email to