Hi Fabian thanx!
1) Is it a strict requirement that a ML pipeline must be able to handle
different input types?
I understand that it makes sense to have different models for different
instances of the same type, i.e., same data type but different keys.
Hence,
the key-based joins make sense to me. However, couldn't completely
different types be handled by different ML pipelines or would there be
major drawbacks?
Could you elaborate more on this? Right now we only use keys when we do
the
join. A given pipeline can handle only a well defined type (the type can
be
a simple string with a custom value, no need to be a
class type) which serves as a key.
2)
I think from an API point of view it would be better to not require
input records to be encoded as ProtoBuf messages. Instead, the model
server
could accept strongly-typed objects (Java/Scala) and (if necessary)
convert
them to ProtoBuf messages internally. In case we need to support
different
types of records (see my first point), we can introduce a Union type
(i.e.,
an n-ary Either type). I see that we need some kind of binary encoding
format for the models but maybe also this can be designed to be
pluggable
such that later other encodings can be added.
We do uses scala classes (strongly typed classes), protobuf is only used
on the wire. For on the wire encoding we prefer protobufs for size,
expressiveness and ability to represent different data types.
3)
I think the DataStream Java API should be supported as a first class
citizens for this library.
I agree. It should be either first priority or a next thing to do.
4)
For the integration with the DataStream API, we could provide an API that
receives (typed) DataStream objects, internally constructs the
DataStream
operators, and returns one (or more) result DataStreams. The benefit is
that we don't need to change the DataStream API directly, but put a
library
on top. The other libraries (CEP, Table, Gelly) follow this approach.
We will provide a DSL which will do jsut this. But even without the DSL
this is what we do with low level joins.
5)
I'm skeptical about using queryable state to expose metrics. Did you
consider using Flink's metrics system [1]? It is easily configurable
and we
provided several reporters that export the metrics.
This of course is an option. The choice of queryable state was mostly
driven by a simplicity of real time integration. Any reason why metrics
system is netter?
Best,
Stavros
On Mon, Nov 27, 2017 at 4:23 PM, Fabian Hueske <fhue...@gmail.com>
wrote:
Hi Stavros,
thanks for the detailed FLIP!
Model serving is an important use case and it's great to see efforts
to add
a library for this to Flink!
I've read the FLIP and would like to ask a few questions and make some
suggestions.
1) Is it a strict requirement that a ML pipeline must be able to handle
different input types?
I understand that it makes sense to have different models for different
instances of the same type, i.e., same data type but different keys.
Hence,
the key-based joins make sense to me. However, couldn't completely
different types be handled by different ML pipelines or would there be
major drawbacks?
2) I think from an API point of view it would be better to not require
input records to be encoded as ProtoBuf messages. Instead, the model
server
could accept strongly-typed objects (Java/Scala) and (if necessary)
convert
them to ProtoBuf messages internally. In case we need to support
different
types of records (see my first point), we can introduce a Union type
(i.e.,
an n-ary Either type). I see that we need some kind of binary encoding
format for the models but maybe also this can be designed to be
pluggable
such that later other encodings can be added.
3) I think the DataStream Java API should be supported as a first class
citizens for this library.
4) For the integration with the DataStream API, we could provide an API
that receives (typed) DataStream objects, internally constructs the
DataStream operators, and returns one (or more) result DataStreams. The
benefit is that we don't need to change the DataStream API directly,
but
put a library on top. The other libraries (CEP, Table, Gelly) follow
this
approach.
5) I'm skeptical about using queryable state to expose metrics. Did you
consider using Flink's metrics system [1]? It is easily configurable
and we
provided several reporters that export the metrics.
What do you think?
Best, Fabian
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.3/
monitoring/
metrics.html
2017-11-23 12:32 GMT+01:00 Stavros Kontopoulos <
st.kontopou...@gmail.com>:
Hi guys,
Let's discuss the new FLIP proposal for model serving over Flink. The
idea
is to combine previous efforts there and provide a library on top of
Flink
for serving models.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-
23+-+Model+Serving
Code from previous efforts can be found here:
https://github.com/FlinkML
Best,
Stavros