Re: Metron-265 Model as a Service

Debo Dutta (dedutta) Thu, 07 Jul 2016 09:46:55 -0700

IMO thrift >> rest. Another option is good old RPC/dRPC :)




On 7/7/16, 9:17 AM, "Casey Stella" <[email protected]> wrote:

>Yeah, I am slowly getting convinced that REST may be too much overhead and
>tending closer to using Thrift and communicating to the model handler
>(possibly in non-java) via some IPC.
>
>On Thu, Jul 7, 2016 at 9:15 AM, Simon Ball <[email protected]> wrote:
>
>> Hi Casey,
>>
>> Just to clarify, my thought was web sockets, not raw sockets, language
>> agnostic, though thrift or proton if would be much better. Even with a non
>> JSON payload, rest is very heavy over http. You be looking at probably
>> 1-2kb header overhead per packet scored just on transport headers. Web
>> socket frames carry slightly less overhead per message.
>>
>> Simon
>>
>>
>> > On 7 Jul 2016, at 16:51, Casey Stella <[email protected]> wrote:
>> >
>> > Regarding the performance of REST:
>> >
>> > Yep, so everyone seems to be worried about the performance implications
>> for
>> > REST.  I made this comment on the JIRA, but I'll repeat it here for
>> broader
>> > discussion:
>> >
>> > My choice of REST was mostly due to the fact that I want to support
>> >> multi-language (I think that's a very important requirement) and there
>> are
>> >> REST libraries for pretty much everything. I do agree, however, that
>> JSON
>> >> transport can get chunky. How about a compromise and use REST, but the
>> >> input and output payloads for scoring are Maps encoded in msgpack rather
>> >> than JSON. There is a msgpack library for pretty much every language out
>> >> there (almost) and certainly all of the ones we'd like to target.
>> >
>> >
>> >> The other option is to just create and expose protobuf bindings (thrift
>> >> doesn't have a native client for R) for all of the languages that we
>> want
>> >> to support. I'm perfectly fine with that, but I had some worries about
>> the
>> >> maturity of the bindings.
>> >
>> >
>> >> The final option, as you suggest, is to just use raw sockets. I think if
>> >> we went that route, we might have to create a layer for each language
>> >> rather than relying on model creators to create a TCP server. I thought
>> >> that might be a bit onerous for a MVP.
>> >
>> >
>> >> Given the discussion, though, what it has made me aware of is that we
>> >> might not want to dictate a transport mechanism at all, but rather allow
>> >> that to be pluggable and extensible (so each model would be associated
>> with
>> >> a transport mechanism handler that would know how to communicate to it.
>> We
>> >> would provide default mechanisms for msgpack over REST, JSON over REST
>> and
>> >> maybe msgpack over raw TCP.) Thoughts?
>> >
>> >
>> > Regarding PMML:
>> >
>> > I tend to agree with James that PMML is too restrictive as to models it
>> can
>> > represent and I have not had great experiences with it in production.
>> > Also, the open source libraries for PMML have licensing issues (jpmml
>> > requires an older version to accommodate our licensing requirements).
>> >
>> > Regarding workflow:
>> >
>> > At the moment, I'd like to focus on getting a generalized infrastructure
>> > for model scoring and updating put in place.   This means, this
>> > architecture takes up the baton from the point when a model is
>> > trained/created.  Also, I have attempted to be generic in terms of output
>> > of the model (a map of results) so it can fit any type of model that I
>> can
>> > think of.  If that's not the case, let me know, though.
>> >
>> > For instance, for clustering, you would probably emit the cluster id
>> > associated with the input and that would be added to the message as it
>> > passes through the storm topology.  The model is responsible for
>> processing
>> > the input and constructing properly formed output.
>> >
>> > Casey
>> >
>> >
>> > On Tue, Jul 5, 2016 at 3:45 PM, Debo Dutta (dedutta) <[email protected]>
>> > wrote:
>> >
>> >> Following up on the thread a little late …. Awesome start Casey. Some
>> >> comments:
>> >> * Model execution
>> >> ** I am guessing the model execution will be on YARN only for now. This
>> is
>> >> fine, but the REST call could have an overhead - depends on the speed.
>> >> * PMML: won’t we have to choose some DSL for describing models?
>> >> * Model:
>> >> ** workflow vs a model -  do we care about the “workflow" that leads to
>> >> the models or just the “model"? For example, we might start with n
>> features
>> >> —> do feature selection to choose k (or apply a transform function) —>
>> >> apply a model etc
>> >> * Use cases - I can see this working for n-ary classification style
>> models
>> >> easily. Will the same mechanism be used for stuff like clustering (or
>> >> intermediate steps like feature selection alone).
>> >>
>> >> Thx
>> >> debo
>> >>
>> >>
>> >>
>> >>
>> >>> On 7/5/16, 3:24 PM, "James Sirota" <[email protected]> wrote:
>> >>>
>> >>> Simon,
>> >>>
>> >>> There are several reasons to decouple model execution from Storm:
>> >>>
>> >>> - Reliability: It's much easier to handle a failed service than a
>> failed
>> >> bolt.  You can also troubleshoot without having to bring down the
>> topology
>> >>> - Complexity: you de-couple the model logic from Storm logic and can
>> >> manage it independently of Storm
>> >>> - Portability: you can swap the model guts (switch from Spark to Flink,
>> >> etc) and as long as you maintain the interface you are good to go
>> >>> - Consistency: since we want to expose our models the same way we
>> expose
>> >> threat intel then it makes sense to expose them as a service
>> >>>
>> >>> In our vision for Metron we want to make it easy to uptake and share
>> >> models.  I think well-defined interfaces and programmatic ways of
>> >> deployment, lifecycle management, and scoring via well-defined REST
>> >> interfaces will make this task easier.  We can do a few things to
>> >>>
>> >>> With respect to PMML I personally had not had much luck with it in
>> >> production.  I would prefer models as POJOs.
>> >>>
>> >>> Thanks,
>> >>> James
>> >>>
>> >>> 04.07.2016, 16:07, "Simon Ball" <[email protected]>:
>> >>>> Since the models' parameters and execution algorithm are likely to be
>> >> small, why not have the model store push the model changes and scoring
>> >> direct to the bolts and execute within storm. This negates the overhead
>> of
>> >> a rest call to the model server, and the need for discovery of the model
>> >> server in zookeeper.
>> >>>>
>> >>>> Something like the way ranger policies are updated / cached in plugins
>> >> would seem to make sense, so that we're distributing the model execution
>> >> directly into the enrichment pipeline rather than collecting in a
>> central
>> >> service.
>> >>>>
>> >>>> This would work with simple models on single events, but may struggle
>> >> with correlation based models. However, those could be handled in storm
>> by
>> >> pushing into a windowing trident topology or something of the sort, or
>> even
>> >> with a parallel spark streaming job using the same method of
>> distributing
>> >> models.
>> >>>>
>> >>>> The real challenge here would be stateful online models, which seem
>> >> like a minority case which could be handled by a shared state store
>> such as
>> >> HBase.
>> >>>>
>> >>>> You still keep the ability to run different languages, and platforms,
>> >> but wrap managing the parallelism in storm bolts rather than yarn
>> >> containers.
>> >>>>
>> >>>> We could also consider basing the model protocol on a a common model
>> >> language like pmml, thong that is likely to be highly limiting.
>> >>>>
>> >>>> Simon
>> >>>>
>> >>>>> On 4 Jul 2016, at 22:35, Casey Stella <[email protected]> wrote:
>> >>>>>
>> >>>>> This is great! I'll capture any requirements that anyone wants to
>> >>>>> contribute and ensure that the proposed architecture accommodates
>> >> them. I
>> >>>>> think we should focus on a minimal set of requirements and an
>> >> architecture
>> >>>>> that does not preclude a larger set. I have found that the best
>> >> driver of
>> >>>>> requirements are installed users. :)
>> >>>>>
>> >>>>> For instance, I think a lot of questions about how often to update a
>> >> model
>> >>>>> and such should be represented in the architecture by the ability to
>> >>>>> manually update a model, so as long as we have the ability to update,
>> >>>>> people can choose when and where to do it (i.e. time based or some
>> >> other
>> >>>>> trigger). That being said, we don't want to cause too much effort for
>> >> the
>> >>>>> user if we can avoid it with features.
>> >>>>>
>> >>>>> In terms of the questions laid out, here are the constraints from the
>> >>>>> proposed architecture as I see them. It'd be great to get a sense of
>> >>>>> whether these constraints are too onerous or where they're not
>> >> opinionated
>> >>>>> enough :
>> >>>>>
>> >>>>>   - Model versioning and retention
>> >>>>>   - We do have the ability to update models, but the training and
>> >> decision
>> >>>>>      of when to update the model is left up to the user. We may want
>> >> to think
>> >>>>>      deeply about when and where automated model updates can fit
>> >>>>>      - Also, retention is currently manual. It might be an easier win
>> >> to
>> >>>>>      set up policies around when to sunset models (after newer
>> >> versions are
>> >>>>>      added, for instance).
>> >>>>>   - Model access controls management
>> >>>>>   - The architecture proposes no constraints around this. As it
>> stands
>> >>>>>      now, models are held in HDFS, so it would inherit the same
>> >> security
>> >>>>>      capabilities from that (user/group permissions + Ranger, etc)
>> >>>>>   - Requirements around concept drift
>> >>>>>   - I'd love to hear user requirements around how we could
>> >> automatically
>> >>>>>      address concept drift. The architecture as it's proposed let's
>> >> the user
>> >>>>>      decide when to update models.
>> >>>>>   - Requirements around model output
>> >>>>>   - The architecture as it stands just mandates a JSON map input and
>> >> JSON
>> >>>>>      map output, so it's up to the model what they want to pass back.
>> >>>>>      - It's also up to the model to document its own output.
>> >>>>>   - Any model audit and logging requirements
>> >>>>>   - The architecture proposes no constraints around this. I'd love to
>> >> see
>> >>>>>      community guidance around this. As it stands, we just log using
>> >> the same
>> >>>>>      mechanism as any YARN application.
>> >>>>>   - What model metrics need to be exposed
>> >>>>>   - The architecture proposes no constraints around this. I'd love to
>> >> see
>> >>>>>      community guidance around this.
>> >>>>>      - Requirements around failure modes
>> >>>>>   - We briefly touch on this in the document, but it is probably not
>> >>>>>      complete. Service endpoint failure will result in blacklisting
>> >> from a
>> >>>>>      storm bolt perspective and node failure should result in a new
>> >> container
>> >>>>>      being started by the Yarn application master. Beyond that, the
>> >>>>>      architecture isn't explicit.
>> >>>>>
>> >>>>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <[email protected]>
>> >> wrote:
>> >>>>>>
>> >>>>>> I left a comment on the JIRA. I think your design is promising. One
>> >>>>>> other thing I would suggest is for us to crowd source requirements
>> >> around
>> >>>>>> model management. Specifically:
>> >>>>>>
>> >>>>>> Model versioning and retention
>> >>>>>> Model access controls management
>> >>>>>> Requirements around concept drift
>> >>>>>> Requirements around model output
>> >>>>>> Any model audit and logging requirements
>> >>>>>> What model metrics need to be exposed
>> >>>>>> Requirements around failure modes
>> >>>>>>
>> >>>>>> 03.07.2016, 14:00, "Casey Stella" <[email protected]>:
>> >>>>>>> Hi all,
>> >>>>>>>
>> >>>>>>> I think we are at the point where we should try to tackle Model as
>> a
>> >>>>>>> service for Metron. As such, I created a JIRA and proposed an
>> >>>>>> architecture
>> >>>>>>> for accomplishing this within Metron.
>> >>>>>>>
>> >>>>>>> My inclination is to be data science language/library agnostic and
>> >> to
>> >>>>>>> provide a general purpose REST infrastructure for managing and
>> >> serving
>> >>>>>>> models trained on historical data captured from Metron. The
>> >> assumption is
>> >>>>>>> that we are within the hadoop ecosystem, so:
>> >>>>>>>
>> >>>>>>>   - Models stored on HDFS
>> >>>>>>>   - REST Model Services resource-managed via Yarn
>> >>>>>>>   - REST Model Services discovered via Zookeeper.
>> >>>>>>>
>> >>>>>>> I would really appreciate community comment on the JIRA (
>> >>>>>>> https://issues.apache.org/jira/browse/METRON-265). The proposed
>> >>>>>>> architecture is attached as a document to that JIRA.
>> >>>>>>>
>> >>>>>>> I look forward to feedback!
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>>
>> >>>>>>> Casey
>> >>>>>>
>> >>>>>> -------------------
>> >>>>>> Thank you,
>> >>>>>>
>> >>>>>> James Sirota
>> >>>>>> PPMC- Apache Metron (Incubating)
>> >>>>>> jsirota AT apache DOT org
>> >>>
>> >>> -------------------
>> >>> Thank you,
>> >>>
>> >>> James Sirota
>> >>> PPMC- Apache Metron (Incubating)
>> >>> jsirota AT apache DOT org
>> >>
>>

Re: Metron-265 Model as a Service

Reply via email to