Slack is a great place to have an interactive conversation, so when a couple of us are around, I'm up for it. This dev list is probably best for async discussions.
I think the Apache confluence site is a good place to put some of the design documents that aren't really code to be committed to the repo. Looks like it is here: https://cwiki.apache.org/confluence/display/SPOT/Apache+Spot+%28Incubating%29+Home I'm still a neophyte on this so open to suggestions. Ethan On Tue, Feb 25, 2020 at 12:35 PM [email protected] <[email protected]> wrote: > Hi can we use slack and start the discussion there? > Thoughts? > Regards. > > El mar., 18 feb. 2020 a las 5:37, Tadd Wood (<[email protected]>) > escribió: > > > I always felt the ODM was rushed in its early design, and to your earlier > > point, it's not surprising that the ODM's flat structure was driven > > strongly by the desire to have Impala or other SQL interpreters as a > > front-end to the data. Although I can appreciate the approachability of > SQL > > and understand the desire for that to drive the data model design, many > > use-cases that were being executed in reality required complex join-logic > > or nested subqueries that didn't scale very well. > > > > Restructuring the design around nouns would make it much easier to pivot > > when trying to express complex and evolving queries. It would be crucial > to > > also allow these nouns to be created with more complex/nested structures, > > providing space for further enrichment of events during ingestion or > later > > on. > > > > I'm excited to be moving the ODM conversation forward, and hope others > > chime in as well as we start to work through the planning process. > > > > Thank you, > > Tadd Wood > > > > On Fri, Feb 14, 2020 at 5:35 PM Austin Leahy <[email protected]> wrote: > > > > > I want to start a discussion about the current state of the ODM. I > think > > > that because of different changes that were in progress at the time we > > > started truly working on it and different miscommunications that the > idea > > > kind of drifted off and we ended up with something that works but isn't > > > something fundamentally scalable and workable for operational security. > > I'm > > > starting this thread to put forward my best understanding of my own > > > concerns about this to facilitate a conversation. > > > > > > 1. Columnar implementation lacking true columnar architecture: > > > > > > Most of the attempts to operationalize spot in the early days ended up > > > leveraging impala and parquet. Because of the ease of table creation > and > > > SQL approachability this seemed appealing but it injected drift into > the > > > ODM. Part of the desire to create the ODM was a desire to formalize > nouns > > > to represent fields in a single store so that "ip" would mean the same > > > thing wherever you saw it. Because of the use of SQL this ultimately > lead > > > to a slow death of that idea and we ended up with fields like > "alert_ip". > > > > > > In my head I can hear some of you asking "ok well as long as the model > is > > > formalized why does this matter?" > > > > > > The reason is because searches at scale would require scanning multiple > > > fields to produce complete answers for example a desire to query "What > > are > > > all of the IP addresses that have communicated with [email protected]?" > > > would need to stitch together one or more queries that possibly join > > > multiple tables and need to consider multiple fields. The benefit of a > > > truly columnar architecture is to simple request the single field from > > the > > > primary operational source and let it loose. > > > > > > > > > 2. Modeling considers sources but not enrichment and objects: > > > > > > To me one of the dream benefits of using apache big data tools to do > > > security is the ability to constantly crawl data and enrich it with new > > > data that lands. In 2015 when I started participating in this project I > > had > > > a hard time articulating why I felt that enrichment would be so > valuable > > > but having participated in various security projects that used kafka > > queues > > > to enrich and update other data in the last 2 years I have a pretty > clear > > > explanation. > > > > > > The ability to understand how different sources fit together has always > > > been a crucial skill for security operators but the reality is that > this > > > has only been the case because inline enrichment had a computational > and > > > storage expense that made it illogical. Today the ever dropping cost of > > > storage and the ever improving performance of tools like Spark make > this > > > skill unnecessary because we can automated tasks like joining current > > user > > > of a machine into a row as that data becomes available. > > > > > > > > > 3. The ODM was supposed to make setting up your operational store > > turnkey: > > > > > > Documenting the ODM has certainly made using Spot easier but I always > > hoped > > > it would make it idiot or more precisely me proof. Currently the ODM > is a > > > guide more than it is a model. Originally we hoped that the ODM would > > turn > > > into code as configuration Nouns defined in JSON similar to the way > that > > > Solr approaches field definitions. > > > > > > "name":"ip", > > > > > > "display": > > > > > > "title": "Host Name", > > > > > > "min_len"": "8", > > > > > > "type":"string" > > > > > > > > > "name":"src", > > > > > > "display": > > > > > > "title": "Source IP", > > > > > > "min_len"": "8", > > > > > > "type":"ip" > > > > > > > > > "name":"dst", > > > > > > "display": > > > > > > "title": "Destination IP", > > > > > > "min_len"": "8", > > > > > > "type":"ip" > > > > > > These field could then be built into sources > > > > > > "device": > > > > > > "manufacturer": "cisco" > > > > > > "model": 1354684 > > > > > > "messages": > > > > > > "title": "alert" > > > > > > "information": > > > > > > "nouns": > > > > > > "host": > > > > > > "stored":true, > > > > > > "required":"yes" > > > > > > "extract":"some regex " > > > > > > "title": "inform" > > > > > > "information": > > > > > > "nouns": > > > > > > "host": > > > > > > "stored":true, > > > > > > "required":"yes" > > > > > > "extract":"some regex" > > > > > > The desire to build these configurations as part of the repositories > > would > > > facilitate an institutional memory around source ingest as well as an > > > ability to clearly articulate what various fields are for some forward > > > looking UI updates. > > > > > > We are going to create an epic and branch for this but I wanted to open > > up > > > discussions here. > > > > > > Thanks Austin > > > > > > PS its been great and exciting to see certain people become active in > the > > > project keep it up we still believe in this. > > > > > >
