Re: Metron-265 Model as a Service

Simon Ball Thu, 07 Jul 2016 09:21:46 -0700

Hi Casey,

Just to clarify, my thought was web sockets, not raw sockets, language 
agnostic, though thrift or proton if would be much better. Even with a non JSON 
payload, rest is very heavy over http. You be looking at probably 1-2kb header 
overhead per packet scored just on transport headers. Web socket frames carry 
slightly less overhead per message.


Simon 


> On 7 Jul 2016, at 16:51, Casey Stella <[email protected]> wrote:
> 
> Regarding the performance of REST:
> 
> Yep, so everyone seems to be worried about the performance implications for
> REST.  I made this comment on the JIRA, but I'll repeat it here for broader
> discussion:
> 
> My choice of REST was mostly due to the fact that I want to support
>> multi-language (I think that's a very important requirement) and there are
>> REST libraries for pretty much everything. I do agree, however, that JSON
>> transport can get chunky. How about a compromise and use REST, but the
>> input and output payloads for scoring are Maps encoded in msgpack rather
>> than JSON. There is a msgpack library for pretty much every language out
>> there (almost) and certainly all of the ones we'd like to target.
> 
> 
>> The other option is to just create and expose protobuf bindings (thrift
>> doesn't have a native client for R) for all of the languages that we want
>> to support. I'm perfectly fine with that, but I had some worries about the
>> maturity of the bindings.
> 
> 
>> The final option, as you suggest, is to just use raw sockets. I think if
>> we went that route, we might have to create a layer for each language
>> rather than relying on model creators to create a TCP server. I thought
>> that might be a bit onerous for a MVP.
> 
> 
>> Given the discussion, though, what it has made me aware of is that we
>> might not want to dictate a transport mechanism at all, but rather allow
>> that to be pluggable and extensible (so each model would be associated with
>> a transport mechanism handler that would know how to communicate to it. We
>> would provide default mechanisms for msgpack over REST, JSON over REST and
>> maybe msgpack over raw TCP.) Thoughts?
> 
> 
> Regarding PMML:
> 
> I tend to agree with James that PMML is too restrictive as to models it can
> represent and I have not had great experiences with it in production.
> Also, the open source libraries for PMML have licensing issues (jpmml
> requires an older version to accommodate our licensing requirements).
> 
> Regarding workflow:
> 
> At the moment, I'd like to focus on getting a generalized infrastructure
> for model scoring and updating put in place.   This means, this
> architecture takes up the baton from the point when a model is
> trained/created.  Also, I have attempted to be generic in terms of output
> of the model (a map of results) so it can fit any type of model that I can
> think of.  If that's not the case, let me know, though.
> 
> For instance, for clustering, you would probably emit the cluster id
> associated with the input and that would be added to the message as it
> passes through the storm topology.  The model is responsible for processing
> the input and constructing properly formed output.
> 
> Casey
> 
> 
> On Tue, Jul 5, 2016 at 3:45 PM, Debo Dutta (dedutta) <[email protected]>
> wrote:
> 
>> Following up on the thread a little late …. Awesome start Casey. Some
>> comments:
>> * Model execution
>> ** I am guessing the model execution will be on YARN only for now. This is
>> fine, but the REST call could have an overhead - depends on the speed.
>> * PMML: won’t we have to choose some DSL for describing models?
>> * Model:
>> ** workflow vs a model -  do we care about the “workflow" that leads to
>> the models or just the “model"? For example, we might start with n features
>> —> do feature selection to choose k (or apply a transform function) —>
>> apply a model etc
>> * Use cases - I can see this working for n-ary classification style models
>> easily. Will the same mechanism be used for stuff like clustering (or
>> intermediate steps like feature selection alone).
>> 
>> Thx
>> debo
>> 
>> 
>> 
>> 
>>> On 7/5/16, 3:24 PM, "James Sirota" <[email protected]> wrote:
>>> 
>>> Simon,
>>> 
>>> There are several reasons to decouple model execution from Storm:
>>> 
>>> - Reliability: It's much easier to handle a failed service than a failed
>> bolt.  You can also troubleshoot without having to bring down the topology
>>> - Complexity: you de-couple the model logic from Storm logic and can
>> manage it independently of Storm
>>> - Portability: you can swap the model guts (switch from Spark to Flink,
>> etc) and as long as you maintain the interface you are good to go
>>> - Consistency: since we want to expose our models the same way we expose
>> threat intel then it makes sense to expose them as a service
>>> 
>>> In our vision for Metron we want to make it easy to uptake and share
>> models.  I think well-defined interfaces and programmatic ways of
>> deployment, lifecycle management, and scoring via well-defined REST
>> interfaces will make this task easier.  We can do a few things to
>>> 
>>> With respect to PMML I personally had not had much luck with it in
>> production.  I would prefer models as POJOs.
>>> 
>>> Thanks,
>>> James
>>> 
>>> 04.07.2016, 16:07, "Simon Ball" <[email protected]>:
>>>> Since the models' parameters and execution algorithm are likely to be
>> small, why not have the model store push the model changes and scoring
>> direct to the bolts and execute within storm. This negates the overhead of
>> a rest call to the model server, and the need for discovery of the model
>> server in zookeeper.
>>>> 
>>>> Something like the way ranger policies are updated / cached in plugins
>> would seem to make sense, so that we're distributing the model execution
>> directly into the enrichment pipeline rather than collecting in a central
>> service.
>>>> 
>>>> This would work with simple models on single events, but may struggle
>> with correlation based models. However, those could be handled in storm by
>> pushing into a windowing trident topology or something of the sort, or even
>> with a parallel spark streaming job using the same method of distributing
>> models.
>>>> 
>>>> The real challenge here would be stateful online models, which seem
>> like a minority case which could be handled by a shared state store such as
>> HBase.
>>>> 
>>>> You still keep the ability to run different languages, and platforms,
>> but wrap managing the parallelism in storm bolts rather than yarn
>> containers.
>>>> 
>>>> We could also consider basing the model protocol on a a common model
>> language like pmml, thong that is likely to be highly limiting.
>>>> 
>>>> Simon
>>>> 
>>>>> On 4 Jul 2016, at 22:35, Casey Stella <[email protected]> wrote:
>>>>> 
>>>>> This is great! I'll capture any requirements that anyone wants to
>>>>> contribute and ensure that the proposed architecture accommodates
>> them. I
>>>>> think we should focus on a minimal set of requirements and an
>> architecture
>>>>> that does not preclude a larger set. I have found that the best
>> driver of
>>>>> requirements are installed users. :)
>>>>> 
>>>>> For instance, I think a lot of questions about how often to update a
>> model
>>>>> and such should be represented in the architecture by the ability to
>>>>> manually update a model, so as long as we have the ability to update,
>>>>> people can choose when and where to do it (i.e. time based or some
>> other
>>>>> trigger). That being said, we don't want to cause too much effort for
>> the
>>>>> user if we can avoid it with features.
>>>>> 
>>>>> In terms of the questions laid out, here are the constraints from the
>>>>> proposed architecture as I see them. It'd be great to get a sense of
>>>>> whether these constraints are too onerous or where they're not
>> opinionated
>>>>> enough :
>>>>> 
>>>>>   - Model versioning and retention
>>>>>   - We do have the ability to update models, but the training and
>> decision
>>>>>      of when to update the model is left up to the user. We may want
>> to think
>>>>>      deeply about when and where automated model updates can fit
>>>>>      - Also, retention is currently manual. It might be an easier win
>> to
>>>>>      set up policies around when to sunset models (after newer
>> versions are
>>>>>      added, for instance).
>>>>>   - Model access controls management
>>>>>   - The architecture proposes no constraints around this. As it stands
>>>>>      now, models are held in HDFS, so it would inherit the same
>> security
>>>>>      capabilities from that (user/group permissions + Ranger, etc)
>>>>>   - Requirements around concept drift
>>>>>   - I'd love to hear user requirements around how we could
>> automatically
>>>>>      address concept drift. The architecture as it's proposed let's
>> the user
>>>>>      decide when to update models.
>>>>>   - Requirements around model output
>>>>>   - The architecture as it stands just mandates a JSON map input and
>> JSON
>>>>>      map output, so it's up to the model what they want to pass back.
>>>>>      - It's also up to the model to document its own output.
>>>>>   - Any model audit and logging requirements
>>>>>   - The architecture proposes no constraints around this. I'd love to
>> see
>>>>>      community guidance around this. As it stands, we just log using
>> the same
>>>>>      mechanism as any YARN application.
>>>>>   - What model metrics need to be exposed
>>>>>   - The architecture proposes no constraints around this. I'd love to
>> see
>>>>>      community guidance around this.
>>>>>      - Requirements around failure modes
>>>>>   - We briefly touch on this in the document, but it is probably not
>>>>>      complete. Service endpoint failure will result in blacklisting
>> from a
>>>>>      storm bolt perspective and node failure should result in a new
>> container
>>>>>      being started by the Yarn application master. Beyond that, the
>>>>>      architecture isn't explicit.
>>>>> 
>>>>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <[email protected]>
>> wrote:
>>>>>> 
>>>>>> I left a comment on the JIRA. I think your design is promising. One
>>>>>> other thing I would suggest is for us to crowd source requirements
>> around
>>>>>> model management. Specifically:
>>>>>> 
>>>>>> Model versioning and retention
>>>>>> Model access controls management
>>>>>> Requirements around concept drift
>>>>>> Requirements around model output
>>>>>> Any model audit and logging requirements
>>>>>> What model metrics need to be exposed
>>>>>> Requirements around failure modes
>>>>>> 
>>>>>> 03.07.2016, 14:00, "Casey Stella" <[email protected]>:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I think we are at the point where we should try to tackle Model as a
>>>>>>> service for Metron. As such, I created a JIRA and proposed an
>>>>>> architecture
>>>>>>> for accomplishing this within Metron.
>>>>>>> 
>>>>>>> My inclination is to be data science language/library agnostic and
>> to
>>>>>>> provide a general purpose REST infrastructure for managing and
>> serving
>>>>>>> models trained on historical data captured from Metron. The
>> assumption is
>>>>>>> that we are within the hadoop ecosystem, so:
>>>>>>> 
>>>>>>>   - Models stored on HDFS
>>>>>>>   - REST Model Services resource-managed via Yarn
>>>>>>>   - REST Model Services discovered via Zookeeper.
>>>>>>> 
>>>>>>> I would really appreciate community comment on the JIRA (
>>>>>>> https://issues.apache.org/jira/browse/METRON-265). The proposed
>>>>>>> architecture is attached as a document to that JIRA.
>>>>>>> 
>>>>>>> I look forward to feedback!
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Casey
>>>>>> 
>>>>>> -------------------
>>>>>> Thank you,
>>>>>> 
>>>>>> James Sirota
>>>>>> PPMC- Apache Metron (Incubating)
>>>>>> jsirota AT apache DOT org
>>> 
>>> -------------------
>>> Thank you,
>>> 
>>> James Sirota
>>> PPMC- Apache Metron (Incubating)
>>> jsirota AT apache DOT org
>>

Re: Metron-265 Model as a Service

Reply via email to