@Casey -- RE: Regarding a doc to describe the model service..do you mean what models are supported? Nope. The doc refer's to a REST service that serves up a Model, but this thread seems to also talk about a service that is separate that is responsible for executing the model.
On Thu, Jul 7, 2016 at 12:46 PM, Debo Dutta (dedutta) <[email protected]> wrote: > IMO thrift >> rest. Another option is good old RPC/dRPC :) > > > > > On 7/7/16, 9:17 AM, "Casey Stella" <[email protected]> wrote: > > >Yeah, I am slowly getting convinced that REST may be too much overhead and > >tending closer to using Thrift and communicating to the model handler > >(possibly in non-java) via some IPC. > > > >On Thu, Jul 7, 2016 at 9:15 AM, Simon Ball <[email protected]> wrote: > > > >> Hi Casey, > >> > >> Just to clarify, my thought was web sockets, not raw sockets, language > >> agnostic, though thrift or proton if would be much better. Even with a > non > >> JSON payload, rest is very heavy over http. You be looking at probably > >> 1-2kb header overhead per packet scored just on transport headers. Web > >> socket frames carry slightly less overhead per message. > >> > >> Simon > >> > >> > >> > On 7 Jul 2016, at 16:51, Casey Stella <[email protected]> wrote: > >> > > >> > Regarding the performance of REST: > >> > > >> > Yep, so everyone seems to be worried about the performance > implications > >> for > >> > REST. I made this comment on the JIRA, but I'll repeat it here for > >> broader > >> > discussion: > >> > > >> > My choice of REST was mostly due to the fact that I want to support > >> >> multi-language (I think that's a very important requirement) and > there > >> are > >> >> REST libraries for pretty much everything. I do agree, however, that > >> JSON > >> >> transport can get chunky. How about a compromise and use REST, but > the > >> >> input and output payloads for scoring are Maps encoded in msgpack > rather > >> >> than JSON. There is a msgpack library for pretty much every language > out > >> >> there (almost) and certainly all of the ones we'd like to target. > >> > > >> > > >> >> The other option is to just create and expose protobuf bindings > (thrift > >> >> doesn't have a native client for R) for all of the languages that we > >> want > >> >> to support. I'm perfectly fine with that, but I had some worries > about > >> the > >> >> maturity of the bindings. > >> > > >> > > >> >> The final option, as you suggest, is to just use raw sockets. I > think if > >> >> we went that route, we might have to create a layer for each language > >> >> rather than relying on model creators to create a TCP server. I > thought > >> >> that might be a bit onerous for a MVP. > >> > > >> > > >> >> Given the discussion, though, what it has made me aware of is that we > >> >> might not want to dictate a transport mechanism at all, but rather > allow > >> >> that to be pluggable and extensible (so each model would be > associated > >> with > >> >> a transport mechanism handler that would know how to communicate to > it. > >> We > >> >> would provide default mechanisms for msgpack over REST, JSON over > REST > >> and > >> >> maybe msgpack over raw TCP.) Thoughts? > >> > > >> > > >> > Regarding PMML: > >> > > >> > I tend to agree with James that PMML is too restrictive as to models > it > >> can > >> > represent and I have not had great experiences with it in production. > >> > Also, the open source libraries for PMML have licensing issues (jpmml > >> > requires an older version to accommodate our licensing requirements). > >> > > >> > Regarding workflow: > >> > > >> > At the moment, I'd like to focus on getting a generalized > infrastructure > >> > for model scoring and updating put in place. This means, this > >> > architecture takes up the baton from the point when a model is > >> > trained/created. Also, I have attempted to be generic in terms of > output > >> > of the model (a map of results) so it can fit any type of model that I > >> can > >> > think of. If that's not the case, let me know, though. > >> > > >> > For instance, for clustering, you would probably emit the cluster id > >> > associated with the input and that would be added to the message as it > >> > passes through the storm topology. The model is responsible for > >> processing > >> > the input and constructing properly formed output. > >> > > >> > Casey > >> > > >> > > >> > On Tue, Jul 5, 2016 at 3:45 PM, Debo Dutta (dedutta) < > [email protected]> > >> > wrote: > >> > > >> >> Following up on the thread a little late …. Awesome start Casey. Some > >> >> comments: > >> >> * Model execution > >> >> ** I am guessing the model execution will be on YARN only for now. > This > >> is > >> >> fine, but the REST call could have an overhead - depends on the > speed. > >> >> * PMML: won’t we have to choose some DSL for describing models? > >> >> * Model: > >> >> ** workflow vs a model - do we care about the “workflow" that leads > to > >> >> the models or just the “model"? For example, we might start with n > >> features > >> >> —> do feature selection to choose k (or apply a transform function) > —> > >> >> apply a model etc > >> >> * Use cases - I can see this working for n-ary classification style > >> models > >> >> easily. Will the same mechanism be used for stuff like clustering (or > >> >> intermediate steps like feature selection alone). > >> >> > >> >> Thx > >> >> debo > >> >> > >> >> > >> >> > >> >> > >> >>> On 7/5/16, 3:24 PM, "James Sirota" <[email protected]> wrote: > >> >>> > >> >>> Simon, > >> >>> > >> >>> There are several reasons to decouple model execution from Storm: > >> >>> > >> >>> - Reliability: It's much easier to handle a failed service than a > >> failed > >> >> bolt. You can also troubleshoot without having to bring down the > >> topology > >> >>> - Complexity: you de-couple the model logic from Storm logic and can > >> >> manage it independently of Storm > >> >>> - Portability: you can swap the model guts (switch from Spark to > Flink, > >> >> etc) and as long as you maintain the interface you are good to go > >> >>> - Consistency: since we want to expose our models the same way we > >> expose > >> >> threat intel then it makes sense to expose them as a service > >> >>> > >> >>> In our vision for Metron we want to make it easy to uptake and share > >> >> models. I think well-defined interfaces and programmatic ways of > >> >> deployment, lifecycle management, and scoring via well-defined REST > >> >> interfaces will make this task easier. We can do a few things to > >> >>> > >> >>> With respect to PMML I personally had not had much luck with it in > >> >> production. I would prefer models as POJOs. > >> >>> > >> >>> Thanks, > >> >>> James > >> >>> > >> >>> 04.07.2016, 16:07, "Simon Ball" <[email protected]>: > >> >>>> Since the models' parameters and execution algorithm are likely to > be > >> >> small, why not have the model store push the model changes and > scoring > >> >> direct to the bolts and execute within storm. This negates the > overhead > >> of > >> >> a rest call to the model server, and the need for discovery of the > model > >> >> server in zookeeper. > >> >>>> > >> >>>> Something like the way ranger policies are updated / cached in > plugins > >> >> would seem to make sense, so that we're distributing the model > execution > >> >> directly into the enrichment pipeline rather than collecting in a > >> central > >> >> service. > >> >>>> > >> >>>> This would work with simple models on single events, but may > struggle > >> >> with correlation based models. However, those could be handled in > storm > >> by > >> >> pushing into a windowing trident topology or something of the sort, > or > >> even > >> >> with a parallel spark streaming job using the same method of > >> distributing > >> >> models. > >> >>>> > >> >>>> The real challenge here would be stateful online models, which seem > >> >> like a minority case which could be handled by a shared state store > >> such as > >> >> HBase. > >> >>>> > >> >>>> You still keep the ability to run different languages, and > platforms, > >> >> but wrap managing the parallelism in storm bolts rather than yarn > >> >> containers. > >> >>>> > >> >>>> We could also consider basing the model protocol on a a common > model > >> >> language like pmml, thong that is likely to be highly limiting. > >> >>>> > >> >>>> Simon > >> >>>> > >> >>>>> On 4 Jul 2016, at 22:35, Casey Stella <[email protected]> wrote: > >> >>>>> > >> >>>>> This is great! I'll capture any requirements that anyone wants to > >> >>>>> contribute and ensure that the proposed architecture accommodates > >> >> them. I > >> >>>>> think we should focus on a minimal set of requirements and an > >> >> architecture > >> >>>>> that does not preclude a larger set. I have found that the best > >> >> driver of > >> >>>>> requirements are installed users. :) > >> >>>>> > >> >>>>> For instance, I think a lot of questions about how often to > update a > >> >> model > >> >>>>> and such should be represented in the architecture by the ability > to > >> >>>>> manually update a model, so as long as we have the ability to > update, > >> >>>>> people can choose when and where to do it (i.e. time based or some > >> >> other > >> >>>>> trigger). That being said, we don't want to cause too much effort > for > >> >> the > >> >>>>> user if we can avoid it with features. > >> >>>>> > >> >>>>> In terms of the questions laid out, here are the constraints from > the > >> >>>>> proposed architecture as I see them. It'd be great to get a sense > of > >> >>>>> whether these constraints are too onerous or where they're not > >> >> opinionated > >> >>>>> enough : > >> >>>>> > >> >>>>> - Model versioning and retention > >> >>>>> - We do have the ability to update models, but the training and > >> >> decision > >> >>>>> of when to update the model is left up to the user. We may > want > >> >> to think > >> >>>>> deeply about when and where automated model updates can fit > >> >>>>> - Also, retention is currently manual. It might be an easier > win > >> >> to > >> >>>>> set up policies around when to sunset models (after newer > >> >> versions are > >> >>>>> added, for instance). > >> >>>>> - Model access controls management > >> >>>>> - The architecture proposes no constraints around this. As it > >> stands > >> >>>>> now, models are held in HDFS, so it would inherit the same > >> >> security > >> >>>>> capabilities from that (user/group permissions + Ranger, etc) > >> >>>>> - Requirements around concept drift > >> >>>>> - I'd love to hear user requirements around how we could > >> >> automatically > >> >>>>> address concept drift. The architecture as it's proposed > let's > >> >> the user > >> >>>>> decide when to update models. > >> >>>>> - Requirements around model output > >> >>>>> - The architecture as it stands just mandates a JSON map input > and > >> >> JSON > >> >>>>> map output, so it's up to the model what they want to pass > back. > >> >>>>> - It's also up to the model to document its own output. > >> >>>>> - Any model audit and logging requirements > >> >>>>> - The architecture proposes no constraints around this. I'd > love to > >> >> see > >> >>>>> community guidance around this. As it stands, we just log > using > >> >> the same > >> >>>>> mechanism as any YARN application. > >> >>>>> - What model metrics need to be exposed > >> >>>>> - The architecture proposes no constraints around this. I'd > love to > >> >> see > >> >>>>> community guidance around this. > >> >>>>> - Requirements around failure modes > >> >>>>> - We briefly touch on this in the document, but it is probably > not > >> >>>>> complete. Service endpoint failure will result in > blacklisting > >> >> from a > >> >>>>> storm bolt perspective and node failure should result in a > new > >> >> container > >> >>>>> being started by the Yarn application master. Beyond that, > the > >> >>>>> architecture isn't explicit. > >> >>>>> > >> >>>>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <[email protected] > > > >> >> wrote: > >> >>>>>> > >> >>>>>> I left a comment on the JIRA. I think your design is promising. > One > >> >>>>>> other thing I would suggest is for us to crowd source > requirements > >> >> around > >> >>>>>> model management. Specifically: > >> >>>>>> > >> >>>>>> Model versioning and retention > >> >>>>>> Model access controls management > >> >>>>>> Requirements around concept drift > >> >>>>>> Requirements around model output > >> >>>>>> Any model audit and logging requirements > >> >>>>>> What model metrics need to be exposed > >> >>>>>> Requirements around failure modes > >> >>>>>> > >> >>>>>> 03.07.2016, 14:00, "Casey Stella" <[email protected]>: > >> >>>>>>> Hi all, > >> >>>>>>> > >> >>>>>>> I think we are at the point where we should try to tackle Model > as > >> a > >> >>>>>>> service for Metron. As such, I created a JIRA and proposed an > >> >>>>>> architecture > >> >>>>>>> for accomplishing this within Metron. > >> >>>>>>> > >> >>>>>>> My inclination is to be data science language/library agnostic > and > >> >> to > >> >>>>>>> provide a general purpose REST infrastructure for managing and > >> >> serving > >> >>>>>>> models trained on historical data captured from Metron. The > >> >> assumption is > >> >>>>>>> that we are within the hadoop ecosystem, so: > >> >>>>>>> > >> >>>>>>> - Models stored on HDFS > >> >>>>>>> - REST Model Services resource-managed via Yarn > >> >>>>>>> - REST Model Services discovered via Zookeeper. > >> >>>>>>> > >> >>>>>>> I would really appreciate community comment on the JIRA ( > >> >>>>>>> https://issues.apache.org/jira/browse/METRON-265). The proposed > >> >>>>>>> architecture is attached as a document to that JIRA. > >> >>>>>>> > >> >>>>>>> I look forward to feedback! > >> >>>>>>> > >> >>>>>>> Best, > >> >>>>>>> > >> >>>>>>> Casey > >> >>>>>> > >> >>>>>> ------------------- > >> >>>>>> Thank you, > >> >>>>>> > >> >>>>>> James Sirota > >> >>>>>> PPMC- Apache Metron (Incubating) > >> >>>>>> jsirota AT apache DOT org > >> >>> > >> >>> ------------------- > >> >>> Thank you, > >> >>> > >> >>> James Sirota > >> >>> PPMC- Apache Metron (Incubating) > >> >>> jsirota AT apache DOT org > >> >> > >> > -- Thanks, Andrew Subscribe to my book: Streaming Data <http://manning.com/psaltis> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
