Hi,

Sorry for the late follow up.

I think I understand the motivation for choosing ProtoBuf as the
representation and serialization format and this makes sense to me.

However, it might be a good idea to provide tooling to convert Flink types
(described as TypeInformation) to ProtoBuf.
Otherwise, users of the model serving library would need to manually
convert their data types (say Scala tuples, case classes, or Avro Pojos) to
ProtoBuf messages.
I don't think that this needs to be included in the first version but it
might be a good extension to make the library easier to use.

Best,
Fabian



2017-11-28 17:22 GMT+01:00 Boris Lublinsky <boris.lublin...@lightbend.com>:

> Thanks Fabian,
> More below
>
>
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Nov 28, 2017, at 8:21 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
> Hi Boris and Stavros,
>
> Thanks for the responses.
>
> Ad 1) Thanks for the clarification. I think I misunderstood this part of
> the proposal.
> I interpreted the argument why to chose ProtoBuf for network encoding 
> ("ability
> to represent different data types") such that different a model pipeline
> should work on different data types.
> I agree that it should be possible to give records of the same type (but
> with different keys) to different models. The key-based join approach looks
> good to me.
>
> Ad 2) I understand that ProtoBuf is a good choice to serialize models for
> the given reasons.
> However, the choice of ProtoBuf serialization for the records might make
> the integration with existing libraries and also regular DataStream
> programs more difficult.
> They all use Flink's TypeSerializer system to serialize and deserialize
> records by default. Hence, we would need to add a conversion step before
> records can be passed to a model serving operator.
> Are you expecting some common format that all records follow (such as a
> Row or Vector type) or do you plan to support arbitrary records such as
> Pojos?
> If you plan for a specific type, you could add a TypeInformation for this
> type with a TypeSerializer that is based on ProtoBuf.
>
> The way I look at it is slightly different. The common format for records,
> supported by Flink, is Byte array with a little bit of header, describing
> data type and is used for routing. The actual unmarshalling is done by the
> model implementation itself. This provides the maximum flexibility and
> gives user the freedom to create his own types without breaking underlying
> framework.
>
> Ad 4) @Boris: I made this point not about the serialization format but how
> the library would integrate with Flink's DataStream API.
> I thought I had seen a code snippet that showed a new method on the
> DataStream object but cannot find this anymore.
> So, I just wanted to make the point that we should not change the
> DataStream API (unless it lacks support for some features) and built the
> model serving library on top of it.
> But I get from Stavros answer that this is your design anyway.
>
> Ad 5) The metrics system is the default way to expose system and job
> metrics in Flink. Due to the pluggable reporter interface and various
> reporters, they can be easily integrated in many production environments.
> A solution based on queryable state will always need custom code to access
> the information. Of course this can be an optional feature.
>
> What do others think about this proposal?
>
> We had agreement among work group - Eron, Bas, Andrea, etc, but you are
> the first one outside of it. My book https://www.lightbend.
> com/blog/serving-machine-learning-models-free-oreilly-ebook-from-lightbend has
> a reasonably good reviews, so we are hoping this will work
>
>
> Best, Fabian
>
>
> 2017-11-28 13:53 GMT+01:00 Stavros Kontopoulos <st.kontopou...@gmail.com>:
>
>> Hi Fabian thanx!
>>
>>
>> > 1) Is it a strict requirement that a ML pipeline must be able to handle
>> > different input types?
>> > I understand that it makes sense to have different models for different
>> > instances of the same type, i.e., same data type but different keys.
>> Hence,
>> > the key-based joins make sense to me. However, couldn't completely
>> > different types be handled by different ML pipelines or would there be
>> > major drawbacks?
>>
>>
>> Could you elaborate more on this? Right now we only use keys when we do
>> the
>> join. A given pipeline can handle only a well defined type (the type can
>> be
>> a simple string with a custom value, no need to be a
>> class type) which serves as a key.
>>
>> 2)
>>
>> I think from an API point of view it would be better to not require
>> > input records to be encoded as ProtoBuf messages. Instead, the model
>> server
>> > could accept strongly-typed objects (Java/Scala) and (if necessary)
>> convert
>> > them to ProtoBuf messages internally. In case we need to support
>> different
>> > types of records (see my first point), we can introduce a Union type
>> (i.e.,
>> > an n-ary Either type). I see that we need some kind of binary encoding
>> > format for the models but maybe also this can be designed to be
>> pluggable
>> > such that later other encodings can be added.
>> >
>>  We do uses scala classes (strongly typed classes), protobuf is only used
>> on the wire. For on the wire encoding we prefer protobufs for size,
>> expressiveness and ability to represent different data types.
>>
>> 3)
>>
>> I think the DataStream Java API should be supported as a first class
>> > citizens for this library.
>>
>>
>> I agree. It should be either first priority or a next thing to do.
>>
>>
>> 4)
>>
>> For the integration with the DataStream API, we could provide an API that
>> > receives (typed) DataStream objects, internally constructs the
>> DataStream
>> > operators, and returns one (or more) result DataStreams. The benefit is
>> > that we don't need to change the DataStream API directly, but put a
>> library
>> > on top. The other libraries (CEP, Table, Gelly) follow this approach.
>>
>>
>>  We will provide a DSL which will do jsut this. But even without the DSL
>> this is what we do with low level joins.
>>
>>
>> 5)
>>
>> > I'm skeptical about using queryable state to expose metrics. Did you
>> > consider using Flink's metrics system [1]? It is easily configurable
>> and we
>> > provided several reporters that export the metrics.
>> >
>> This of course is an option. The choice of queryable state was mostly
>> driven by a simplicity of real time integration.  Any reason why metrics
>> system is netter?
>>
>>
>> Best,
>> Stavros
>>
>> On Mon, Nov 27, 2017 at 4:23 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>>
>> > Hi Stavros,
>> >
>> > thanks for the detailed FLIP!
>> > Model serving is an important use case and it's great to see efforts to
>> add
>> > a library for this to Flink!
>> >
>> > I've read the FLIP and would like to ask a few questions and make some
>> > suggestions.
>> >
>> > 1) Is it a strict requirement that a ML pipeline must be able to handle
>> > different input types?
>> > I understand that it makes sense to have different models for different
>> > instances of the same type, i.e., same data type but different keys.
>> Hence,
>> > the key-based joins make sense to me. However, couldn't completely
>> > different types be handled by different ML pipelines or would there be
>> > major drawbacks?
>> >
>> > 2) I think from an API point of view it would be better to not require
>> > input records to be encoded as ProtoBuf messages. Instead, the model
>> server
>> > could accept strongly-typed objects (Java/Scala) and (if necessary)
>> convert
>> > them to ProtoBuf messages internally. In case we need to support
>> different
>> > types of records (see my first point), we can introduce a Union type
>> (i.e.,
>> > an n-ary Either type). I see that we need some kind of binary encoding
>> > format for the models but maybe also this can be designed to be
>> pluggable
>> > such that later other encodings can be added.
>> >
>> > 3) I think the DataStream Java API should be supported as a first class
>> > citizens for this library.
>> >
>> > 4) For the integration with the DataStream API, we could provide an API
>> > that receives (typed) DataStream objects, internally constructs the
>> > DataStream operators, and returns one (or more) result DataStreams. The
>> > benefit is that we don't need to change the DataStream API directly, but
>> > put a library on top. The other libraries (CEP, Table, Gelly) follow
>> this
>> > approach.
>> >
>> > 5) I'm skeptical about using queryable state to expose metrics. Did you
>> > consider using Flink's metrics system [1]? It is easily configurable
>> and we
>> > provided several reporters that export the metrics.
>> >
>> > What do you think?
>> > Best, Fabian
>> >
>> > [1]
>> > https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/
>> > metrics.html
>> >
>> > 2017-11-23 12:32 GMT+01:00 Stavros Kontopoulos <
>> st.kontopou...@gmail.com>:
>> >
>> > > Hi guys,
>> > >
>> > > Let's discuss the new FLIP proposal for model serving over Flink. The
>> > idea
>> > > is to combine previous efforts there and provide a library on top of
>> > Flink
>> > > for serving models.
>> > >
>> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
>> > 23+-+Model+Serving
>> > >
>> > > Code from previous efforts can be found here:
>> https://github.com/FlinkML
>> > >
>> > > Best,
>> > > Stavros
>> > >
>> >
>>
>
>
>

Reply via email to