Following up on the thread a little late …. Awesome start Casey. Some comments: * Model execution ** I am guessing the model execution will be on YARN only for now. This is fine, but the REST call could have an overhead - depends on the speed. * PMML: won’t we have to choose some DSL for describing models? * Model: ** workflow vs a model - do we care about the “workflow" that leads to the models or just the “model"? For example, we might start with n features —> do feature selection to choose k (or apply a transform function) —> apply a model etc * Use cases - I can see this working for n-ary classification style models easily. Will the same mechanism be used for stuff like clustering (or intermediate steps like feature selection alone).
Thx debo On 7/5/16, 3:24 PM, "James Sirota" <[email protected]> wrote: >Simon, > >There are several reasons to decouple model execution from Storm: > >- Reliability: It's much easier to handle a failed service than a failed bolt. > You can also troubleshoot without having to bring down the topology >- Complexity: you de-couple the model logic from Storm logic and can manage it >independently of Storm >- Portability: you can swap the model guts (switch from Spark to Flink, etc) >and as long as you maintain the interface you are good to go >- Consistency: since we want to expose our models the same way we expose >threat intel then it makes sense to expose them as a service > >In our vision for Metron we want to make it easy to uptake and share models. >I think well-defined interfaces and programmatic ways of deployment, lifecycle >management, and scoring via well-defined REST interfaces will make this task >easier. We can do a few things to > >With respect to PMML I personally had not had much luck with it in production. > I would prefer models as POJOs. > >Thanks, >James > >04.07.2016, 16:07, "Simon Ball" <[email protected]>: >> Since the models' parameters and execution algorithm are likely to be small, >> why not have the model store push the model changes and scoring direct to >> the bolts and execute within storm. This negates the overhead of a rest call >> to the model server, and the need for discovery of the model server in >> zookeeper. >> >> Something like the way ranger policies are updated / cached in plugins would >> seem to make sense, so that we're distributing the model execution directly >> into the enrichment pipeline rather than collecting in a central service. >> >> This would work with simple models on single events, but may struggle with >> correlation based models. However, those could be handled in storm by >> pushing into a windowing trident topology or something of the sort, or even >> with a parallel spark streaming job using the same method of distributing >> models. >> >> The real challenge here would be stateful online models, which seem like a >> minority case which could be handled by a shared state store such as HBase. >> >> You still keep the ability to run different languages, and platforms, but >> wrap managing the parallelism in storm bolts rather than yarn containers. >> >> We could also consider basing the model protocol on a a common model >> language like pmml, thong that is likely to be highly limiting. >> >> Simon >> >>> On 4 Jul 2016, at 22:35, Casey Stella <[email protected]> wrote: >>> >>> This is great! I'll capture any requirements that anyone wants to >>> contribute and ensure that the proposed architecture accommodates them. I >>> think we should focus on a minimal set of requirements and an architecture >>> that does not preclude a larger set. I have found that the best driver of >>> requirements are installed users. :) >>> >>> For instance, I think a lot of questions about how often to update a model >>> and such should be represented in the architecture by the ability to >>> manually update a model, so as long as we have the ability to update, >>> people can choose when and where to do it (i.e. time based or some other >>> trigger). That being said, we don't want to cause too much effort for the >>> user if we can avoid it with features. >>> >>> In terms of the questions laid out, here are the constraints from the >>> proposed architecture as I see them. It'd be great to get a sense of >>> whether these constraints are too onerous or where they're not opinionated >>> enough : >>> >>> - Model versioning and retention >>> - We do have the ability to update models, but the training and decision >>> of when to update the model is left up to the user. We may want to >>> think >>> deeply about when and where automated model updates can fit >>> - Also, retention is currently manual. It might be an easier win to >>> set up policies around when to sunset models (after newer versions are >>> added, for instance). >>> - Model access controls management >>> - The architecture proposes no constraints around this. As it stands >>> now, models are held in HDFS, so it would inherit the same security >>> capabilities from that (user/group permissions + Ranger, etc) >>> - Requirements around concept drift >>> - I'd love to hear user requirements around how we could automatically >>> address concept drift. The architecture as it's proposed let's the >>> user >>> decide when to update models. >>> - Requirements around model output >>> - The architecture as it stands just mandates a JSON map input and JSON >>> map output, so it's up to the model what they want to pass back. >>> - It's also up to the model to document its own output. >>> - Any model audit and logging requirements >>> - The architecture proposes no constraints around this. I'd love to see >>> community guidance around this. As it stands, we just log using the >>> same >>> mechanism as any YARN application. >>> - What model metrics need to be exposed >>> - The architecture proposes no constraints around this. I'd love to see >>> community guidance around this. >>> - Requirements around failure modes >>> - We briefly touch on this in the document, but it is probably not >>> complete. Service endpoint failure will result in blacklisting from a >>> storm bolt perspective and node failure should result in a new >>> container >>> being started by the Yarn application master. Beyond that, the >>> architecture isn't explicit. >>> >>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <[email protected]> wrote: >>>> >>>> I left a comment on the JIRA. I think your design is promising. One >>>> other thing I would suggest is for us to crowd source requirements around >>>> model management. Specifically: >>>> >>>> Model versioning and retention >>>> Model access controls management >>>> Requirements around concept drift >>>> Requirements around model output >>>> Any model audit and logging requirements >>>> What model metrics need to be exposed >>>> Requirements around failure modes >>>> >>>> 03.07.2016, 14:00, "Casey Stella" <[email protected]>: >>>>> Hi all, >>>>> >>>>> I think we are at the point where we should try to tackle Model as a >>>>> service for Metron. As such, I created a JIRA and proposed an >>>> architecture >>>>> for accomplishing this within Metron. >>>>> >>>>> My inclination is to be data science language/library agnostic and to >>>>> provide a general purpose REST infrastructure for managing and serving >>>>> models trained on historical data captured from Metron. The assumption is >>>>> that we are within the hadoop ecosystem, so: >>>>> >>>>> - Models stored on HDFS >>>>> - REST Model Services resource-managed via Yarn >>>>> - REST Model Services discovered via Zookeeper. >>>>> >>>>> I would really appreciate community comment on the JIRA ( >>>>> https://issues.apache.org/jira/browse/METRON-265). The proposed >>>>> architecture is attached as a document to that JIRA. >>>>> >>>>> I look forward to feedback! >>>>> >>>>> Best, >>>>> >>>>> Casey >>>> >>>> ------------------- >>>> Thank you, >>>> >>>> James Sirota >>>> PPMC- Apache Metron (Incubating) >>>> jsirota AT apache DOT org > >------------------- >Thank you, > >James Sirota >PPMC- Apache Metron (Incubating) >jsirota AT apache DOT org
