Hi Matt, First off, thanks for the great response. I'm going to try and keep this one brief but there's a lot of nuance here that I want to make sure is obvious.
We need to store long non_analyzed fields because (1) wildcards <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html> are only available on not_analyzed fields in Elasticsearch (because they function at the term level <https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html>), and (2) proximity matching <https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html> and ordering terms is not straightforward/available to string fields (a way to fix this would be a custom analyzer (the standard analyzer doesn't apply here)). For instance, when trying to find a specific web shell that's being used on your network, you may just care to search for a URI that contains "c99shell.php" or "r57shell.php" which is a much faster search with an analyzed field. You may also have an APT that planted a specific shell with a certain name on your system, or a certain path, so you'd want to search "/this/is/a/bad/shell.php*" as is only allowed with not analyzed field (accounts for query strings, etc.). Other fields have similar scenarios, such as searching a cookie value for shellshock, etc. It's also important to note that I'm not currently considering HDFS data "reasonable" for a SOC analyst to access within Metron at this time, but one of the things that I am proposing is that we make it reasonable, or at least start to lay the foundation so it can be reasonable in the future. For your follow-up to (a), I think that the timestamp + 32K string prefix wouldn't be a bad indicator of a unique record, if it is only being used to connect HDFS to Lucene. However, my understanding is that there is already a unique indicator being used upstream which just isn't persisted (I would defer to others on this one, this is just hearsay...) and if that exists, I'd vote to just persist it and use that. I would also argue both that (1) incredibly high integrity of data is important for both the security use case and data at this scale, and (2) these fields could intentionally manipulated by the client to avoid indexing. For your follow-up to (b), I agree that this is another avenue we could take. Personally, because of the relative infrequency of this event (when not intentionally abused) and the fact that we may have people directly querying elasticsearch (and not using the UI) I prefer a truncated field which then has an obvious indicator in the JSON like truncated:true, but I guess we could also have a field_split:true which performs similarly. Probably the best reason why I can say that I prefer the former is to keep the amount of data in Elasticsearch smaller and trust HDFS to be the true, large ("full") datastore at an architectural level. I could see an argument that it makes out-of-band querying/third-party UIs slightly more complicated because they would need to query both Elasticsearch and HDFS, as opposed to doing a more complicated query but only to Elasticsearch. Regarding multi-field, you're exactly right. In my last minute editing of the first email I forgot to add that it would need to be truncated as well. Finally, I'm not sure I follow your last paragraph. What does "other records will simply lack a URI field" mean, exactly? I don't see a scenario where that is necessary, unless you are referring to non-(Bro+HTTP) records. I think that this does have to be federated into search, because (1) this field is under the control of the client (potentially an adversary) and thus can be purposefully abused, and (2) if Metron goes 1.0 without fully resolving this issue (including in the UIs), it could significantly impact uptake and reliance. Jon On Mon, Oct 31, 2016 at 6:28 PM Matt Foley <[email protected]> wrote: > Hi Jon, interesting topic. A couple questions come to mind: > > What is the reason we need to store very long non_analyzed values in the > Index (which evidently is not its design sweet spot)? Is it because: > a. It is valuable to be able to Search against the whole string even when > that string is > 32K? > b. If HDFS is not also being used, the Index is the only source of > historical truth for these records? > > If the answer is (a), the follow-up question is: Isn’t it safe to assume > that there’s a timestamp near the beginning of the raw string? So if we > provided a 32K prefix string match, including the timestamp, wouldn’t that > be pretty much as good? Or are there lots of cases where the first 32K+, > including the timestamp, are truly the same, yet they differ after 32K of > text? If 32K prefix string match is sufficient for 99% of cases, then a > fixed-length truncation limit slightly less than 32K, on both Index and > Search, will suffice. This is essentially your approach #2 – and it’s > simple. > > If the answer is (b), then we can be satisfied with any approach that > splits the very long strings up somehow and stores them all in a way that > allows reassembling them in the correct sequence. It does require a > reformulator for querying in the UI, as you note in your “Concerns - > Thoughts #1” below. > > I’m probably misunderstanding, but I don’t see how multi-field helps. > According to the elastic.co doc you reference, multi-field allows storing > and searching both the analyzed and not_analyzed sub-fields without > doubling the storage size (which is clearly very useful), but the > non_analyzed sub-field should still have the 32K limit. Is this not so? > Or are you proposing that a multi-field mapping could encapsulate the > several sub-strings needed to contain a >32K string? Eg, as “raw”, “raw1”, > “raw2”, etc., where each is <32K ? > > A sub-case of approach #2, relating to your second “Other Thoughts”, would > be: > Always truncate the indexed string to slightly less than 32K, but store > the full value of any such string in HDFS, and include in the Index a > reference (file URI with offset) that allows retrieving it. This solution > can be limited to just the >32K strings, so other records will simply lack > a URI field. And it doesn’t have to be federated into Search as you > suggest: The 32K prefix string Search should be quite adequate as > suggested above, and then the whole string can be read from HDFS if needed > for historical reasons. > > Cheers, > --Matt > > On 10/31/16, 1:38 PM, "[email protected]" <[email protected]> wrote: > > Hi All, > > I've been doing a bit of bug hunting with bro logs within the search > tier > of Metron over the last couple of weeks. During this (in this thread > I'm > primarily referring to METRON-517 > <https://issues.apache.org/jira/browse/METRON-517>) I've found a > couple of > issues that I'd like to discuss with the wider population. The primary > issue here is due to a limitation in Lucene itself, meaning we will > encounter the same problem with either Elasticsearch or Solr as far as > I > can tell. > > *Problems* > > 1. Lucene, including the latest version (6.2.1), appears to have a hard > coded maximum term length of 32766 (reference > < > https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH > > > here <%22> for > <https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766> > details <https://github.com/apache/lucene-solr/searc>). If the > indexingBolt attempts to input a non_analyzed > < > https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2 > > > tuple (set via these files <https://github.co>) which exceeds that > limit, > the entire message is rejected. > - If you simply analyze the field it reduces the size for any > individual > term, but it also throws a wrench in your queries, when you are > searching > for a match of that entire field. > > 2. From what I can tell, failures are only logged via > /var/log/elasticsearch/metron.log and in the Storm UI under the Bolt > Id's > "Last error" column. > - It looks like this is already partially documented as METRON-307 > <https://issues.apache.org/jira/browse/METRON-307>. > > > From here on out I'm going to focus on Problem 1. > > > *Thoughts* > > 1. We could use multifield > < > https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html > > > mappings to be able to do both a full and partial search for fields > that > exceed 32766 length. > > 2. Truncate fields in the indexingBolt to keep non-analyzed values > below > the 32766 limit. > > 3. Ensure that any field with the ability to grow beyond the 32766 > limit is > analyzed, and that no single term surpasses the max term limit. > > There are some other ways to fix the problem, such as to not store the > field > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html > >, > not index the field > <https://www.elastic.co/guide/en/elasticsearch/refe>, ignore > fields larger than a set value > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html > >, > etc. but I personally see these as confusing (to the end user) and not > very > helpful. Others have brought up dynamic templates > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html > > > as well, but I haven't looked into them yet. > > > *Concerns* > > Thought #1 > > - My current favourite (this is also what logstash does), but it > requires > that we analyze the whole message and store a truncated version of the > whole message as a single large term. If truncation occurs we would > need > to: > > - Add key-value pairs to the tuple that indicates that it was > truncated, what field(s) was/were truncated, the pre-truncated size of > the > field(s), hash of pre-truncated field, and a timestamp of truncation > (i.e. > Data tampering). > > - Provide UI elements that clearly show that a specific message was > truncated. > > - May need to abstract querying in the UI. If so, this requires a > sub-task > to METRON-195 and looking into an interim solution with Kibana. > > - See "Other thoughts". > > > Thought #2 > > - If we go this path we’d need to address how to do a full string match > (i.e. abstract a copy/paste of a > 32766 length URI to use as a query > in > the UI). This may or may not be possible with Kibana – if not, this > needs > to be a subtask in METRON-195. > > - Add key-value pairs to the tuple that indicates that it was > truncated, > what field(s) was/were truncated, the pre-truncated size of the > field(s), > hash of pre-truncated field, and a timestamp of truncation (i.e. Data > tampering). > > - Provide UI elements that clearly show that a specific message was > truncated. > > - See "Other thoughts". > > > Thought #3 > > - Not a huge fan of this solution because of how it affects whole > string > matching. > > - May need a custom analyzer to cut up the URI properly. Here > <%22http> are > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html > > > some > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-> > relevant <http://docs.oracle.com/javase/7/docs/api/java/net/URL.html> > materials > <http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html> > if > <https://tools.ietf.org/html/rfc398> we > <https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3> go > <http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html> > that > <http://www.regexplanet.com/advanced/java/index.html> path > <http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html > >. > > > *Other thoughts* > > - Maybe we can use the profiler to monitor truncated=True and watch for > people messing with this on purpose. > > - We could add a persistent UUID to every tuple and map HDFS against > Elasticsearch data. This could be used by a UI/frontend to query > across > both datastores. Very useful in the case of truncation - provide a > configurable setting that is false by default, but if set to true it > will > query HDFS for data it got which has truncated:true in the indexed > store. > > > I have more thoughts but this has gotten more than long enough already > and > I wanted to send it off today. Thoughts? > Jon > -- > > Jon > > > > > > -- Jon
