Yeah, I am slowly getting convinced that REST may be too much overhead and tending closer to using Thrift and communicating to the model handler (possibly in non-java) via some IPC.
On Thu, Jul 7, 2016 at 9:15 AM, Simon Ball <[email protected]> wrote: > Hi Casey, > > Just to clarify, my thought was web sockets, not raw sockets, language > agnostic, though thrift or proton if would be much better. Even with a non > JSON payload, rest is very heavy over http. You be looking at probably > 1-2kb header overhead per packet scored just on transport headers. Web > socket frames carry slightly less overhead per message. > > Simon > > > > On 7 Jul 2016, at 16:51, Casey Stella <[email protected]> wrote: > > > > Regarding the performance of REST: > > > > Yep, so everyone seems to be worried about the performance implications > for > > REST. I made this comment on the JIRA, but I'll repeat it here for > broader > > discussion: > > > > My choice of REST was mostly due to the fact that I want to support > >> multi-language (I think that's a very important requirement) and there > are > >> REST libraries for pretty much everything. I do agree, however, that > JSON > >> transport can get chunky. How about a compromise and use REST, but the > >> input and output payloads for scoring are Maps encoded in msgpack rather > >> than JSON. There is a msgpack library for pretty much every language out > >> there (almost) and certainly all of the ones we'd like to target. > > > > > >> The other option is to just create and expose protobuf bindings (thrift > >> doesn't have a native client for R) for all of the languages that we > want > >> to support. I'm perfectly fine with that, but I had some worries about > the > >> maturity of the bindings. > > > > > >> The final option, as you suggest, is to just use raw sockets. I think if > >> we went that route, we might have to create a layer for each language > >> rather than relying on model creators to create a TCP server. I thought > >> that might be a bit onerous for a MVP. > > > > > >> Given the discussion, though, what it has made me aware of is that we > >> might not want to dictate a transport mechanism at all, but rather allow > >> that to be pluggable and extensible (so each model would be associated > with > >> a transport mechanism handler that would know how to communicate to it. > We > >> would provide default mechanisms for msgpack over REST, JSON over REST > and > >> maybe msgpack over raw TCP.) Thoughts? > > > > > > Regarding PMML: > > > > I tend to agree with James that PMML is too restrictive as to models it > can > > represent and I have not had great experiences with it in production. > > Also, the open source libraries for PMML have licensing issues (jpmml > > requires an older version to accommodate our licensing requirements). > > > > Regarding workflow: > > > > At the moment, I'd like to focus on getting a generalized infrastructure > > for model scoring and updating put in place. This means, this > > architecture takes up the baton from the point when a model is > > trained/created. Also, I have attempted to be generic in terms of output > > of the model (a map of results) so it can fit any type of model that I > can > > think of. If that's not the case, let me know, though. > > > > For instance, for clustering, you would probably emit the cluster id > > associated with the input and that would be added to the message as it > > passes through the storm topology. The model is responsible for > processing > > the input and constructing properly formed output. > > > > Casey > > > > > > On Tue, Jul 5, 2016 at 3:45 PM, Debo Dutta (dedutta) <[email protected]> > > wrote: > > > >> Following up on the thread a little late …. Awesome start Casey. Some > >> comments: > >> * Model execution > >> ** I am guessing the model execution will be on YARN only for now. This > is > >> fine, but the REST call could have an overhead - depends on the speed. > >> * PMML: won’t we have to choose some DSL for describing models? > >> * Model: > >> ** workflow vs a model - do we care about the “workflow" that leads to > >> the models or just the “model"? For example, we might start with n > features > >> —> do feature selection to choose k (or apply a transform function) —> > >> apply a model etc > >> * Use cases - I can see this working for n-ary classification style > models > >> easily. Will the same mechanism be used for stuff like clustering (or > >> intermediate steps like feature selection alone). > >> > >> Thx > >> debo > >> > >> > >> > >> > >>> On 7/5/16, 3:24 PM, "James Sirota" <[email protected]> wrote: > >>> > >>> Simon, > >>> > >>> There are several reasons to decouple model execution from Storm: > >>> > >>> - Reliability: It's much easier to handle a failed service than a > failed > >> bolt. You can also troubleshoot without having to bring down the > topology > >>> - Complexity: you de-couple the model logic from Storm logic and can > >> manage it independently of Storm > >>> - Portability: you can swap the model guts (switch from Spark to Flink, > >> etc) and as long as you maintain the interface you are good to go > >>> - Consistency: since we want to expose our models the same way we > expose > >> threat intel then it makes sense to expose them as a service > >>> > >>> In our vision for Metron we want to make it easy to uptake and share > >> models. I think well-defined interfaces and programmatic ways of > >> deployment, lifecycle management, and scoring via well-defined REST > >> interfaces will make this task easier. We can do a few things to > >>> > >>> With respect to PMML I personally had not had much luck with it in > >> production. I would prefer models as POJOs. > >>> > >>> Thanks, > >>> James > >>> > >>> 04.07.2016, 16:07, "Simon Ball" <[email protected]>: > >>>> Since the models' parameters and execution algorithm are likely to be > >> small, why not have the model store push the model changes and scoring > >> direct to the bolts and execute within storm. This negates the overhead > of > >> a rest call to the model server, and the need for discovery of the model > >> server in zookeeper. > >>>> > >>>> Something like the way ranger policies are updated / cached in plugins > >> would seem to make sense, so that we're distributing the model execution > >> directly into the enrichment pipeline rather than collecting in a > central > >> service. > >>>> > >>>> This would work with simple models on single events, but may struggle > >> with correlation based models. However, those could be handled in storm > by > >> pushing into a windowing trident topology or something of the sort, or > even > >> with a parallel spark streaming job using the same method of > distributing > >> models. > >>>> > >>>> The real challenge here would be stateful online models, which seem > >> like a minority case which could be handled by a shared state store > such as > >> HBase. > >>>> > >>>> You still keep the ability to run different languages, and platforms, > >> but wrap managing the parallelism in storm bolts rather than yarn > >> containers. > >>>> > >>>> We could also consider basing the model protocol on a a common model > >> language like pmml, thong that is likely to be highly limiting. > >>>> > >>>> Simon > >>>> > >>>>> On 4 Jul 2016, at 22:35, Casey Stella <[email protected]> wrote: > >>>>> > >>>>> This is great! I'll capture any requirements that anyone wants to > >>>>> contribute and ensure that the proposed architecture accommodates > >> them. I > >>>>> think we should focus on a minimal set of requirements and an > >> architecture > >>>>> that does not preclude a larger set. I have found that the best > >> driver of > >>>>> requirements are installed users. :) > >>>>> > >>>>> For instance, I think a lot of questions about how often to update a > >> model > >>>>> and such should be represented in the architecture by the ability to > >>>>> manually update a model, so as long as we have the ability to update, > >>>>> people can choose when and where to do it (i.e. time based or some > >> other > >>>>> trigger). That being said, we don't want to cause too much effort for > >> the > >>>>> user if we can avoid it with features. > >>>>> > >>>>> In terms of the questions laid out, here are the constraints from the > >>>>> proposed architecture as I see them. It'd be great to get a sense of > >>>>> whether these constraints are too onerous or where they're not > >> opinionated > >>>>> enough : > >>>>> > >>>>> - Model versioning and retention > >>>>> - We do have the ability to update models, but the training and > >> decision > >>>>> of when to update the model is left up to the user. We may want > >> to think > >>>>> deeply about when and where automated model updates can fit > >>>>> - Also, retention is currently manual. It might be an easier win > >> to > >>>>> set up policies around when to sunset models (after newer > >> versions are > >>>>> added, for instance). > >>>>> - Model access controls management > >>>>> - The architecture proposes no constraints around this. As it > stands > >>>>> now, models are held in HDFS, so it would inherit the same > >> security > >>>>> capabilities from that (user/group permissions + Ranger, etc) > >>>>> - Requirements around concept drift > >>>>> - I'd love to hear user requirements around how we could > >> automatically > >>>>> address concept drift. The architecture as it's proposed let's > >> the user > >>>>> decide when to update models. > >>>>> - Requirements around model output > >>>>> - The architecture as it stands just mandates a JSON map input and > >> JSON > >>>>> map output, so it's up to the model what they want to pass back. > >>>>> - It's also up to the model to document its own output. > >>>>> - Any model audit and logging requirements > >>>>> - The architecture proposes no constraints around this. I'd love to > >> see > >>>>> community guidance around this. As it stands, we just log using > >> the same > >>>>> mechanism as any YARN application. > >>>>> - What model metrics need to be exposed > >>>>> - The architecture proposes no constraints around this. I'd love to > >> see > >>>>> community guidance around this. > >>>>> - Requirements around failure modes > >>>>> - We briefly touch on this in the document, but it is probably not > >>>>> complete. Service endpoint failure will result in blacklisting > >> from a > >>>>> storm bolt perspective and node failure should result in a new > >> container > >>>>> being started by the Yarn application master. Beyond that, the > >>>>> architecture isn't explicit. > >>>>> > >>>>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <[email protected]> > >> wrote: > >>>>>> > >>>>>> I left a comment on the JIRA. I think your design is promising. One > >>>>>> other thing I would suggest is for us to crowd source requirements > >> around > >>>>>> model management. Specifically: > >>>>>> > >>>>>> Model versioning and retention > >>>>>> Model access controls management > >>>>>> Requirements around concept drift > >>>>>> Requirements around model output > >>>>>> Any model audit and logging requirements > >>>>>> What model metrics need to be exposed > >>>>>> Requirements around failure modes > >>>>>> > >>>>>> 03.07.2016, 14:00, "Casey Stella" <[email protected]>: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I think we are at the point where we should try to tackle Model as > a > >>>>>>> service for Metron. As such, I created a JIRA and proposed an > >>>>>> architecture > >>>>>>> for accomplishing this within Metron. > >>>>>>> > >>>>>>> My inclination is to be data science language/library agnostic and > >> to > >>>>>>> provide a general purpose REST infrastructure for managing and > >> serving > >>>>>>> models trained on historical data captured from Metron. The > >> assumption is > >>>>>>> that we are within the hadoop ecosystem, so: > >>>>>>> > >>>>>>> - Models stored on HDFS > >>>>>>> - REST Model Services resource-managed via Yarn > >>>>>>> - REST Model Services discovered via Zookeeper. > >>>>>>> > >>>>>>> I would really appreciate community comment on the JIRA ( > >>>>>>> https://issues.apache.org/jira/browse/METRON-265). The proposed > >>>>>>> architecture is attached as a document to that JIRA. > >>>>>>> > >>>>>>> I look forward to feedback! > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Casey > >>>>>> > >>>>>> ------------------- > >>>>>> Thank you, > >>>>>> > >>>>>> James Sirota > >>>>>> PPMC- Apache Metron (Incubating) > >>>>>> jsirota AT apache DOT org > >>> > >>> ------------------- > >>> Thank you, > >>> > >>> James Sirota > >>> PPMC- Apache Metron (Incubating) > >>> jsirota AT apache DOT org > >> >
