I’ll ping a couple people I know, too. Can’t guarantee a prompt response :-)
You said the current solution actually rejects indexing any message > 32K. That seems unfortunate. But I would consider a completely “dumb” truncation solution (simplest form of your Solution #2, even without any warnings or indicators) to be very acceptable. After all, even the current situation deals with the vast majority of messages, and truncating would deal with 99% of use cases for the remainder. Speaking loosely of course! --Matt On 11/3/16, 11:25 AM, "[email protected]" <[email protected]> wrote: I totally agree - unfortunately I wasn't able to find anybody who publicly posted a similar enough scenario to make use of. I did reach out to some people on the elastic and lucene teams for validation of what I put forward in METRON-542 and got a thumbs up, but I'm always ok with more review if you happen to have some contacts. I will say that I'm comfortable with the current solution, especially if I'm able to realize some benefit to monitoring abuse with the profiler. What do you think? Jon On Thu, Nov 3, 2016 at 2:15 PM Matt Foley <[email protected]> wrote: > Right, for both (1) and (2) there is the problem that if you happen to > break the long string in the middle of a substring that you want to use for > a search term, you don’t see anything with the non-analyzed string search, > and may not see anything with the analyzed string search if you broke or > truncated in the middle of a token instead of between tokens. And agree > entirely with you about the needs for full data integrity as an option for > those willing to spend the storage. > > > > Agree this is a hard problem. The best response I can think of is that > this has to be a previously solved problem (or perhaps a > previously-worked-around problem, since this looks like it may not have a > complete solution). We really should consult with ES/Solr experts about > best practice in these cases. I’m not one, altho I’ve done a couple > projects. Of course if you are one, please don’t take this wrong :-) But > I would definitely welcome input from folks with a lot more experience than > I in full text indexing of large documents specifically. > > > > I guess my take on it is to stop thinking of such large bodies of text as > a “field” and starting thinking of it as a “document”, where things like > ordered proximity matching have known (if imperfect) solutions. > > > > Thanks, > > --Matt > > > > > > On 11/1/16, 6:36 AM, "[email protected]" <[email protected]> wrote: > > > > Hi Matt, > > > > First off, thanks for the great response. I'm going to try and keep > this > > one brief but there's a lot of nuance here that I want to make sure is > > obvious. > > > > We need to store long non_analyzed fields because (1) wildcards > > < > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html > > > > are only available on not_analyzed fields in Elasticsearch (because > > they function > > at the term level > > < > https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html > >), > > and (2) proximity matching > > < > https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html > > > > and ordering terms is not straightforward/available to string fields > (a way > > to fix this would be a custom analyzer (the standard analyzer doesn't > apply > > here)). For instance, when trying to find a specific web shell that's > > being used on your network, you may just care to search for a URI that > > contains "c99shell.php" or "r57shell.php" which is a much faster search > > with an analyzed field. You may also have an APT that planted a > specific > > shell with a certain name on your system, or a certain path, so you'd > want > > to search "/this/is/a/bad/shell.php*" as is only allowed with not > analyzed > > field (accounts for query strings, etc.). Other fields have similar > > scenarios, such as searching a cookie value for shellshock, etc. > > > > It's also important to note that I'm not currently considering HDFS > data > > "reasonable" for a SOC analyst to access within Metron at this time, > but > > one of the things that I am proposing is that we make it reasonable, > or at > > least start to lay the foundation so it can be reasonable in the > future. > > > > For your follow-up to (a), I think that the timestamp + 32K string > prefix > > wouldn't be a bad indicator of a unique record, if it is only being > used to > > connect HDFS to Lucene. However, my understanding is that there is > already > > a unique indicator being used upstream which just isn't persisted (I > would > > defer to others on this one, this is just hearsay...) and if that > exists, > > I'd vote to just persist it and use that. I would also argue both > that (1) > > incredibly high integrity of data is important for both the security > use > > case and data at this scale, and (2) these fields could intentionally > > manipulated by the client to avoid indexing. > > > > For your follow-up to (b), I agree that this is another avenue we could > > take. Personally, because of the relative infrequency of this event > (when > > not intentionally abused) and the fact that we may have people directly > > querying elasticsearch (and not using the UI) I prefer a truncated > field > > which then has an obvious indicator in the JSON like truncated:true, > but I > > guess we could also have a field_split:true which performs similarly. > > Probably the best reason why I can say that I prefer the former is to > keep > > the amount of data in Elasticsearch smaller and trust HDFS to be the > true, > > large ("full") datastore at an architectural level. I could see an > > argument that it makes out-of-band querying/third-party UIs slightly > more > > complicated because they would need to query both Elasticsearch and > HDFS, > > as opposed to doing a more complicated query but only to Elasticsearch. > > > > Regarding multi-field, you're exactly right. In my last minute > editing of > > the first email I forgot to add that it would need to be truncated as > well. > > > > > > Finally, I'm not sure I follow your last paragraph. What does "other > > records will simply lack a URI field" mean, exactly? I don't see a > > scenario where that is necessary, unless you are referring to > > non-(Bro+HTTP) records. I think that this does have to be federated > into > > search, because (1) this field is under the control of the client > > (potentially an adversary) and thus can be purposefully abused, and > (2) if > > Metron goes 1.0 without fully resolving this issue (including in the > UIs), > > it could significantly impact uptake and reliance. > > > > Jon > > > > On Mon, Oct 31, 2016 at 6:28 PM Matt Foley <[email protected]> wrote: > > > > > Hi Jon, interesting topic. A couple questions come to mind: > > > > > > What is the reason we need to store very long non_analyzed values in > the > > > Index (which evidently is not its design sweet spot)? Is it because: > > > a. It is valuable to be able to Search against the whole string even > when > > > that string is > 32K? > > > b. If HDFS is not also being used, the Index is the only source of > > > historical truth for these records? > > > > > > If the answer is (a), the follow-up question is: Isn’t it safe to > assume > > > that there’s a timestamp near the beginning of the raw string? So > if we > > > provided a 32K prefix string match, including the timestamp, > wouldn’t that > > > be pretty much as good? Or are there lots of cases where the first > 32K+, > > > including the timestamp, are truly the same, yet they differ after > 32K of > > > text? If 32K prefix string match is sufficient for 99% of cases, > then a > > > fixed-length truncation limit slightly less than 32K, on both Index > and > > > Search, will suffice. This is essentially your approach #2 – and > it’s > > > simple. > > > > > > If the answer is (b), then we can be satisfied with any approach that > > > splits the very long strings up somehow and stores them all in a way > that > > > allows reassembling them in the correct sequence. It does require a > > > reformulator for querying in the UI, as you note in your “Concerns - > > > Thoughts #1” below. > > > > > > I’m probably misunderstanding, but I don’t see how multi-field helps. > > > According to the elastic.co doc you reference, multi-field allows > storing > > > and searching both the analyzed and not_analyzed sub-fields without > > > doubling the storage size (which is clearly very useful), but the > > > non_analyzed sub-field should still have the 32K limit. Is this not > so? > > > Or are you proposing that a multi-field mapping could encapsulate the > > > several sub-strings needed to contain a >32K string? Eg, as “raw”, > “raw1”, > > > “raw2”, etc., where each is <32K ? > > > > > > A sub-case of approach #2, relating to your second “Other Thoughts”, > would > > > be: > > > Always truncate the indexed string to slightly less than 32K, but > store > > > the full value of any such string in HDFS, and include in the Index a > > > reference (file URI with offset) that allows retrieving it. This > solution > > > can be limited to just the >32K strings, so other records will > simply lack > > > a URI field. And it doesn’t have to be federated into Search as you > > > suggest: The 32K prefix string Search should be quite adequate as > > > suggested above, and then the whole string can be read from HDFS if > needed > > > for historical reasons. > > > > > > Cheers, > > > --Matt > > > > > > On 10/31/16, 1:38 PM, "[email protected]" <[email protected]> wrote: > > > > > > Hi All, > > > > > > I've been doing a bit of bug hunting with bro logs within the > search > > > tier > > > of Metron over the last couple of weeks. During this (in this > thread > > > I'm > > > primarily referring to METRON-517 > > > <https://issues.apache.org/jira/browse/METRON-517>) I've found a > > > couple of > > > issues that I'd like to discuss with the wider population. The > primary > > > issue here is due to a limitation in Lucene itself, meaning we > will > > > encounter the same problem with either Elasticsearch or Solr as > far as > > > I > > > can tell. > > > > > > *Problems* > > > > > > 1. Lucene, including the latest version (6.2.1), appears to have > a hard > > > coded maximum term length of 32766 (reference > > > < > > > > https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH > > > > > > > here <%22> for > > > < > https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766> > > > details <https://github.com/apache/lucene-solr/searc>). If the > > > indexingBolt attempts to input a non_analyzed > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2 > > > > > > > tuple (set via these files <https://github.co>) which exceeds > that > > > limit, > > > the entire message is rejected. > > > - If you simply analyze the field it reduces the size for any > > > individual > > > term, but it also throws a wrench in your queries, when you are > > > searching > > > for a match of that entire field. > > > > > > 2. From what I can tell, failures are only logged via > > > /var/log/elasticsearch/metron.log and in the Storm UI under the > Bolt > > > Id's > > > "Last error" column. > > > - It looks like this is already partially documented as > METRON-307 > > > <https://issues.apache.org/jira/browse/METRON-307>. > > > > > > > > > From here on out I'm going to focus on Problem 1. > > > > > > > > > *Thoughts* > > > > > > 1. We could use multifield > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html > > > > > > > mappings to be able to do both a full and partial search for > fields > > > that > > > exceed 32766 length. > > > > > > 2. Truncate fields in the indexingBolt to keep non-analyzed > values > > > below > > > the 32766 limit. > > > > > > 3. Ensure that any field with the ability to grow beyond the > 32766 > > > limit is > > > analyzed, and that no single term surpasses the max term limit. > > > > > > There are some other ways to fix the problem, such as to not > store the > > > field > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html > > > >, > > > not index the field > > > <https://www.elastic.co/guide/en/elasticsearch/refe>, ignore > > > fields larger than a set value > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html > > > >, > > > etc. but I personally see these as confusing (to the end user) > and not > > > very > > > helpful. Others have brought up dynamic templates > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html > > > > > > > as well, but I haven't looked into them yet. > > > > > > > > > *Concerns* > > > > > > Thought #1 > > > > > > - My current favourite (this is also what logstash does), but it > > > requires > > > that we analyze the whole message and store a truncated version > of the > > > whole message as a single large term. If truncation occurs we > would > > > need > > > to: > > > > > > - Add key-value pairs to the tuple that indicates that it was > > > truncated, what field(s) was/were truncated, the pre-truncated > size of > > > the > > > field(s), hash of pre-truncated field, and a timestamp of > truncation > > > (i.e. > > > Data tampering). > > > > > > - Provide UI elements that clearly show that a specific > message was > > > truncated. > > > > > > - May need to abstract querying in the UI. If so, this requires > a > > > sub-task > > > to METRON-195 and looking into an interim solution with Kibana. > > > > > > - See "Other thoughts". > > > > > > > > > Thought #2 > > > > > > - If we go this path we’d need to address how to do a full > string match > > > (i.e. abstract a copy/paste of a > 32766 length URI to use as a > query > > > in > > > the UI). This may or may not be possible with Kibana – if not, > this > > > needs > > > to be a subtask in METRON-195. > > > > > > - Add key-value pairs to the tuple that indicates that it was > > > truncated, > > > what field(s) was/were truncated, the pre-truncated size of the > > > field(s), > > > hash of pre-truncated field, and a timestamp of truncation (i.e. > Data > > > tampering). > > > > > > - Provide UI elements that clearly show that a specific message > was > > > truncated. > > > > > > - See "Other thoughts". > > > > > > > > > Thought #3 > > > > > > - Not a huge fan of this solution because of how it affects whole > > > string > > > matching. > > > > > > - May need a custom analyzer to cut up the URI properly. Here > > > <%22http> are > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html > > > > > > > some > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-> > > > relevant < > http://docs.oracle.com/javase/7/docs/api/java/net/URL.html> > > > materials > > > < > http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html> > > > if > > > <https://tools.ietf.org/html/rfc398> we > > > <https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3> > go > > > < > http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html> > > > that > > > <http://www.regexplanet.com/advanced/java/index.html> path > > > < > http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html > > > >. > > > > > > > > > *Other thoughts* > > > > > > - Maybe we can use the profiler to monitor truncated=True and > watch for > > > people messing with this on purpose. > > > > > > - We could add a persistent UUID to every tuple and map HDFS > against > > > Elasticsearch data. This could be used by a UI/frontend to query > > > across > > > both datastores. Very useful in the case of truncation - > provide a > > > configurable setting that is false by default, but if set to > true it > > > will > > > query HDFS for data it got which has truncated:true in the > indexed > > > store. > > > > > > > > > I have more thoughts but this has gotten more than long enough > already > > > and > > > I wanted to send it off today. Thoughts? > > > Jon > > > -- > > > > > > Jon > > > > > > > > > > > > > > > > > > -- > > > > Jon > > > > > > -- Jon
