I want to start a discussion about the current state of the ODM. I think that because of different changes that were in progress at the time we started truly working on it and different miscommunications that the idea kind of drifted off and we ended up with something that works but isn't something fundamentally scalable and workable for operational security. I'm starting this thread to put forward my best understanding of my own concerns about this to facilitate a conversation.
1. Columnar implementation lacking true columnar architecture: Most of the attempts to operationalize spot in the early days ended up leveraging impala and parquet. Because of the ease of table creation and SQL approachability this seemed appealing but it injected drift into the ODM. Part of the desire to create the ODM was a desire to formalize nouns to represent fields in a single store so that "ip" would mean the same thing wherever you saw it. Because of the use of SQL this ultimately lead to a slow death of that idea and we ended up with fields like "alert_ip". In my head I can hear some of you asking "ok well as long as the model is formalized why does this matter?" The reason is because searches at scale would require scanning multiple fields to produce complete answers for example a desire to query "What are all of the IP addresses that have communicated with le...@apache.org?" would need to stitch together one or more queries that possibly join multiple tables and need to consider multiple fields. The benefit of a truly columnar architecture is to simple request the single field from the primary operational source and let it loose. 2. Modeling considers sources but not enrichment and objects: To me one of the dream benefits of using apache big data tools to do security is the ability to constantly crawl data and enrich it with new data that lands. In 2015 when I started participating in this project I had a hard time articulating why I felt that enrichment would be so valuable but having participated in various security projects that used kafka queues to enrich and update other data in the last 2 years I have a pretty clear explanation. The ability to understand how different sources fit together has always been a crucial skill for security operators but the reality is that this has only been the case because inline enrichment had a computational and storage expense that made it illogical. Today the ever dropping cost of storage and the ever improving performance of tools like Spark make this skill unnecessary because we can automated tasks like joining current user of a machine into a row as that data becomes available. 3. The ODM was supposed to make setting up your operational store turnkey: Documenting the ODM has certainly made using Spot easier but I always hoped it would make it idiot or more precisely me proof. Currently the ODM is a guide more than it is a model. Originally we hoped that the ODM would turn into code as configuration Nouns defined in JSON similar to the way that Solr approaches field definitions. "name":"ip", "display": "title": "Host Name", "min_len"": "8", "type":"string" "name":"src", "display": "title": "Source IP", "min_len"": "8", "type":"ip" "name":"dst", "display": "title": "Destination IP", "min_len"": "8", "type":"ip" These field could then be built into sources "device": "manufacturer": "cisco" "model": 1354684 "messages": "title": "alert" "information": "nouns": "host": "stored":true, "required":"yes" "extract":"some regex " "title": "inform" "information": "nouns": "host": "stored":true, "required":"yes" "extract":"some regex" The desire to build these configurations as part of the repositories would facilitate an institutional memory around source ingest as well as an ability to clearly articulate what various fields are for some forward looking UI updates. We are going to create an epic and branch for this but I wanted to open up discussions here. Thanks Austin PS its been great and exciting to see certain people become active in the project keep it up we still believe in this.