Re: Moving towards ODM v2

Ethan Pemberton Tue, 25 Feb 2020 17:20:45 -0800

Slack is a great place to have an interactive conversation, so when a
couple of us are around, I'm up for it. This dev list is probably best for
async discussions.


I think the Apache confluence site is a good place to put some of the
design documents that aren't really code to be committed to the repo.

Looks like it is here:
https://cwiki.apache.org/confluence/display/SPOT/Apache+Spot+%28Incubating%29+Home

I'm still a neophyte on this so open to suggestions.

Ethan

On Tue, Feb 25, 2020 at 12:35 PM [email protected] <[email protected]>
wrote:

> Hi can we use slack and start the discussion there?
> Thoughts?
> Regards.
>
> El mar., 18 feb. 2020 a las 5:37, Tadd Wood (<[email protected]>)
> escribió:
>
> > I always felt the ODM was rushed in its early design, and to your earlier
> > point, it's not surprising that the ODM's flat structure was driven
> > strongly by the desire to have Impala or other SQL interpreters as a
> > front-end to the data. Although I can appreciate the approachability of
> SQL
> > and understand the desire for that to drive the data model design, many
> > use-cases that were being executed in reality required complex join-logic
> > or nested subqueries that didn't scale very well.
> >
> > Restructuring the design around nouns would make it much easier to pivot
> > when trying to express complex and evolving queries. It would be crucial
> to
> > also allow these nouns to be created with more complex/nested structures,
> > providing space for further enrichment of events during ingestion or
> later
> > on.
> >
> > I'm excited to be moving the ODM conversation forward, and hope others
> > chime in as well as we start to work through the planning process.
> >
> > Thank you,
> > Tadd Wood
> >
> > On Fri, Feb 14, 2020 at 5:35 PM Austin Leahy <[email protected]> wrote:
> >
> > > I want to start a discussion about the current state of the ODM. I
> think
> > > that because of different changes that were in progress at the time we
> > > started truly working on it and different miscommunications that the
> idea
> > > kind of drifted off and we ended up with something that works but isn't
> > > something fundamentally scalable and workable for operational security.
> > I'm
> > > starting this thread to put forward my best understanding of my own
> > > concerns about this to facilitate a conversation.
> > >
> > > 1. Columnar implementation lacking true columnar architecture:
> > >
> > > Most of the attempts to operationalize spot in the early days ended up
> > > leveraging impala and parquet. Because of the ease of table creation
> and
> > > SQL approachability this seemed appealing but it injected drift into
> the
> > > ODM. Part of the desire to create the ODM was a desire to formalize
> nouns
> > > to represent fields in a single store so that "ip" would mean the same
> > > thing wherever you saw it. Because of the use of SQL this ultimately
> lead
> > > to a slow death of that idea and we ended up with fields like
> "alert_ip".
> > >
> > > In my head I can hear some of you asking "ok well as long as the model
> is
> > > formalized why does this matter?"
> > >
> > > The reason is because searches at scale would require scanning multiple
> > > fields to produce complete answers for example a desire to query "What
> > are
> > > all of the IP addresses that have communicated with [email protected]?"
> > > would need to stitch together one or more queries that possibly join
> > > multiple tables and need to consider multiple fields. The benefit of a
> > > truly columnar architecture is to simple request the single field from
> > the
> > > primary operational source and let it loose.
> > >
> > >
> > > 2. Modeling considers sources but not enrichment and objects:
> > >
> > > To me one of the dream benefits of using apache big data tools to do
> > > security is the ability to constantly crawl data and enrich it with new
> > > data that lands. In 2015 when I started participating in this project I
> > had
> > > a hard time articulating why I felt that enrichment would be so
> valuable
> > > but having participated in various security projects that used kafka
> > queues
> > > to enrich and update other data in the last 2 years I have a pretty
> clear
> > > explanation.
> > >
> > > The ability to understand how different sources fit together has always
> > > been a crucial skill for security operators but the reality is that
> this
> > > has only been the case because inline enrichment had a computational
> and
> > > storage expense that made it illogical. Today the ever dropping cost of
> > > storage and the ever improving performance of tools like Spark make
> this
> > > skill unnecessary because we can automated tasks like joining current
> > user
> > > of a machine into a row as that data becomes available.
> > >
> > >
> > > 3. The ODM was supposed to make setting up your operational store
> > turnkey:
> > >
> > > Documenting the ODM has certainly made using Spot easier but I always
> > hoped
> > > it would make it idiot or more precisely me proof. Currently the ODM
> is a
> > > guide more than it is a model. Originally we hoped that the ODM would
> > turn
> > > into code as configuration Nouns defined in JSON similar to the way
> that
> > > Solr approaches field definitions.
> > >
> > >  "name":"ip",
> > >
> > > "display":
> > >
> > > "title": "Host Name",
> > >
> > > "min_len"": "8",
> > >
> > >  "type":"string"
> > >
> > >
> > >  "name":"src",
> > >
> > > "display":
> > >
> > > "title": "Source IP",
> > >
> > > "min_len"": "8",
> > >
> > > "type":"ip"
> > >
> > >
> > >  "name":"dst",
> > >
> > > "display":
> > >
> > > "title": "Destination IP",
> > >
> > > "min_len"": "8",
> > >
> > > "type":"ip"
> > >
> > > These field could then be built into sources
> > >
> > > "device":
> > >
> > > "manufacturer": "cisco"
> > >
> > > "model": 1354684
> > >
> > > "messages":
> > >
> > > "title": "alert"
> > >
> > > "information":
> > >
> > > "nouns":
> > >
> > > "host":
> > >
> > >  "stored":true,
> > >
> > > "required":"yes"
> > >
> > > "extract":"some regex "
> > >
> > > "title": "inform"
> > >
> > > "information":
> > >
> > > "nouns":
> > >
> > > "host":
> > >
> > > "stored":true,
> > >
> > > "required":"yes"
> > >
> > > "extract":"some regex"
> > >
> > > The desire to build these configurations as part of the repositories
> > would
> > > facilitate an institutional memory around source ingest as well as an
> > > ability to clearly articulate what various fields are for some forward
> > > looking UI updates.
> > >
> > > We are going to create an epic and branch for this but I wanted to open
> > up
> > > discussions here.
> > >
> > > Thanks Austin
> > >
> > > PS its been great and exciting to see certain people become active in
> the
> > > project keep it up we still believe in this.
> > >
> >
>

Re: Moving towards ODM v2

Reply via email to