Right, that is the current state, and the short term fix here is to use ignore_above in the template, which will allow us to drop only that field and not the entire message. I'm not a huge fan of that solution, but it's better than the alternative and I've tried to make sure this issue is well documented (542 and all sub-tasks) so it will see eventual resolution in a way that can be easily mirrored (or even abstracted) to other fields that have a similar situation.
Jon On Thu, Nov 3, 2016 at 3:50 PM Matt Foley <[email protected]> wrote: > I’ll ping a couple people I know, too. Can’t guarantee a prompt response > :-) > > > > You said the current solution actually rejects indexing any message > 32K. > > That seems unfortunate. But I would consider a completely “dumb” > truncation > > solution (simplest form of your Solution #2, even without any warnings or > indicators) > > to be very acceptable. After all, even the current situation deals with > the vast majority > > of messages, and truncating would deal with 99% of use cases for the > remainder. > > Speaking loosely of course! > > > > --Matt > > > > On 11/3/16, 11:25 AM, "[email protected]" <[email protected]> wrote: > > > > I totally agree - unfortunately I wasn't able to find anybody who > publicly > > posted a similar enough scenario to make use of. I did reach out to > some > > people on the elastic and lucene teams for validation of what I put > forward > > in METRON-542 and got a thumbs up, but I'm always ok with more review > if > > you happen to have some contacts. I will say that I'm comfortable > with the > > current solution, especially if I'm able to realize some benefit to > > monitoring abuse with the profiler. What do you think? > > > > Jon > > > > On Thu, Nov 3, 2016 at 2:15 PM Matt Foley <[email protected]> wrote: > > > > > Right, for both (1) and (2) there is the problem that if you happen > to > > > break the long string in the middle of a substring that you want to > use for > > > a search term, you don’t see anything with the non-analyzed string > search, > > > and may not see anything with the analyzed string search if you > broke or > > > truncated in the middle of a token instead of between tokens. And > agree > > > entirely with you about the needs for full data integrity as an > option for > > > those willing to spend the storage. > > > > > > > > > > > > Agree this is a hard problem. The best response I can think of is > that > > > this has to be a previously solved problem (or perhaps a > > > previously-worked-around problem, since this looks like it may not > have a > > > complete solution). We really should consult with ES/Solr experts > about > > > best practice in these cases. I’m not one, altho I’ve done a couple > > > projects. Of course if you are one, please don’t take this wrong > :-) But > > > I would definitely welcome input from folks with a lot more > experience than > > > I in full text indexing of large documents specifically. > > > > > > > > > > > > I guess my take on it is to stop thinking of such large bodies of > text as > > > a “field” and starting thinking of it as a “document”, where things > like > > > ordered proximity matching have known (if imperfect) solutions. > > > > > > > > > > > > Thanks, > > > > > > --Matt > > > > > > > > > > > > > > > > > > On 11/1/16, 6:36 AM, "[email protected]" <[email protected]> wrote: > > > > > > > > > > > > Hi Matt, > > > > > > > > > > > > First off, thanks for the great response. I'm going to try and > keep > > > this > > > > > > one brief but there's a lot of nuance here that I want to make > sure is > > > > > > obvious. > > > > > > > > > > > > We need to store long non_analyzed fields because (1) wildcards > > > > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html > > > > > > > > > > are only available on not_analyzed fields in Elasticsearch > (because > > > > > > they function > > > > > > at the term level > > > > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html > > > >), > > > > > > and (2) proximity matching > > > > > > < > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html > > > > > > > > > > and ordering terms is not straightforward/available to string > fields > > > (a way > > > > > > to fix this would be a custom analyzer (the standard analyzer > doesn't > > > apply > > > > > > here)). For instance, when trying to find a specific web shell > that's > > > > > > being used on your network, you may just care to search for a > URI that > > > > > > contains "c99shell.php" or "r57shell.php" which is a much faster > search > > > > > > with an analyzed field. You may also have an APT that planted a > > > specific > > > > > > shell with a certain name on your system, or a certain path, so > you'd > > > want > > > > > > to search "/this/is/a/bad/shell.php*" as is only allowed with not > > > analyzed > > > > > > field (accounts for query strings, etc.). Other fields have > similar > > > > > > scenarios, such as searching a cookie value for shellshock, etc. > > > > > > > > > > > > It's also important to note that I'm not currently considering > HDFS > > > data > > > > > > "reasonable" for a SOC analyst to access within Metron at this > time, > > > but > > > > > > one of the things that I am proposing is that we make it > reasonable, > > > or at > > > > > > least start to lay the foundation so it can be reasonable in the > > > future. > > > > > > > > > > > > For your follow-up to (a), I think that the timestamp + 32K > string > > > prefix > > > > > > wouldn't be a bad indicator of a unique record, if it is only > being > > > used to > > > > > > connect HDFS to Lucene. However, my understanding is that there > is > > > already > > > > > > a unique indicator being used upstream which just isn't > persisted (I > > > would > > > > > > defer to others on this one, this is just hearsay...) and if that > > > exists, > > > > > > I'd vote to just persist it and use that. I would also argue > both > > > that (1) > > > > > > incredibly high integrity of data is important for both the > security > > > use > > > > > > case and data at this scale, and (2) these fields could > intentionally > > > > > > manipulated by the client to avoid indexing. > > > > > > > > > > > > For your follow-up to (b), I agree that this is another avenue > we could > > > > > > take. Personally, because of the relative infrequency of this > event > > > (when > > > > > > not intentionally abused) and the fact that we may have people > directly > > > > > > querying elasticsearch (and not using the UI) I prefer a > truncated > > > field > > > > > > which then has an obvious indicator in the JSON like > truncated:true, > > > but I > > > > > > guess we could also have a field_split:true which performs > similarly. > > > > > > Probably the best reason why I can say that I prefer the former > is to > > > keep > > > > > > the amount of data in Elasticsearch smaller and trust HDFS to be > the > > > true, > > > > > > large ("full") datastore at an architectural level. I could see > an > > > > > > argument that it makes out-of-band querying/third-party UIs > slightly > > > more > > > > > > complicated because they would need to query both Elasticsearch > and > > > HDFS, > > > > > > as opposed to doing a more complicated query but only to > Elasticsearch. > > > > > > > > > > > > Regarding multi-field, you're exactly right. In my last minute > > > editing of > > > > > > the first email I forgot to add that it would need to be > truncated as > > > well. > > > > > > > > > > > > > > > > > > Finally, I'm not sure I follow your last paragraph. What does > "other > > > > > > records will simply lack a URI field" mean, exactly? I don't > see a > > > > > > scenario where that is necessary, unless you are referring to > > > > > > non-(Bro+HTTP) records. I think that this does have to be > federated > > > into > > > > > > search, because (1) this field is under the control of the client > > > > > > (potentially an adversary) and thus can be purposefully abused, > and > > > (2) if > > > > > > Metron goes 1.0 without fully resolving this issue (including in > the > > > UIs), > > > > > > it could significantly impact uptake and reliance. > > > > > > > > > > > > Jon > > > > > > > > > > > > On Mon, Oct 31, 2016 at 6:28 PM Matt Foley <[email protected]> > wrote: > > > > > > > > > > > > > Hi Jon, interesting topic. A couple questions come to mind: > > > > > > > > > > > > > > What is the reason we need to store very long non_analyzed > values in > > > the > > > > > > > Index (which evidently is not its design sweet spot)? Is it > because: > > > > > > > a. It is valuable to be able to Search against the whole > string even > > > when > > > > > > > that string is > 32K? > > > > > > > b. If HDFS is not also being used, the Index is the only > source of > > > > > > > historical truth for these records? > > > > > > > > > > > > > > If the answer is (a), the follow-up question is: Isn’t it > safe to > > > assume > > > > > > > that there’s a timestamp near the beginning of the raw > string? So > > > if we > > > > > > > provided a 32K prefix string match, including the timestamp, > > > wouldn’t that > > > > > > > be pretty much as good? Or are there lots of cases where the > first > > > 32K+, > > > > > > > including the timestamp, are truly the same, yet they differ > after > > > 32K of > > > > > > > text? If 32K prefix string match is sufficient for 99% of > cases, > > > then a > > > > > > > fixed-length truncation limit slightly less than 32K, on both > Index > > > and > > > > > > > Search, will suffice. This is essentially your approach #2 – > and > > > it’s > > > > > > > simple. > > > > > > > > > > > > > > If the answer is (b), then we can be satisfied with any > approach that > > > > > > > splits the very long strings up somehow and stores them all in > a way > > > that > > > > > > > allows reassembling them in the correct sequence. It does > require a > > > > > > > reformulator for querying in the UI, as you note in your > “Concerns - > > > > > > > Thoughts #1” below. > > > > > > > > > > > > > > I’m probably misunderstanding, but I don’t see how multi-field > helps. > > > > > > > According to the elastic.co doc you reference, multi-field > allows > > > storing > > > > > > > and searching both the analyzed and not_analyzed sub-fields > without > > > > > > > doubling the storage size (which is clearly very useful), but > the > > > > > > > non_analyzed sub-field should still have the 32K limit. Is > this not > > > so? > > > > > > > Or are you proposing that a multi-field mapping could > encapsulate the > > > > > > > several sub-strings needed to contain a >32K string? Eg, as > “raw”, > > > “raw1”, > > > > > > > “raw2”, etc., where each is <32K ? > > > > > > > > > > > > > > A sub-case of approach #2, relating to your second “Other > Thoughts”, > > > would > > > > > > > be: > > > > > > > Always truncate the indexed string to slightly less than 32K, > but > > > store > > > > > > > the full value of any such string in HDFS, and include in the > Index a > > > > > > > reference (file URI with offset) that allows retrieving it. > This > > > solution > > > > > > > can be limited to just the >32K strings, so other records will > > > simply lack > > > > > > > a URI field. And it doesn’t have to be federated into Search > as you > > > > > > > suggest: The 32K prefix string Search should be quite > adequate as > > > > > > > suggested above, and then the whole string can be read from > HDFS if > > > needed > > > > > > > for historical reasons. > > > > > > > > > > > > > > Cheers, > > > > > > > --Matt > > > > > > > > > > > > > > On 10/31/16, 1:38 PM, "[email protected]" <[email protected]> > wrote: > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > I've been doing a bit of bug hunting with bro logs within > the > > > search > > > > > > > tier > > > > > > > of Metron over the last couple of weeks. During this (in > this > > > thread > > > > > > > I'm > > > > > > > primarily referring to METRON-517 > > > > > > > <https://issues.apache.org/jira/browse/METRON-517>) I've > found a > > > > > > > couple of > > > > > > > issues that I'd like to discuss with the wider > population. The > > > primary > > > > > > > issue here is due to a limitation in Lucene itself, > meaning we > > > will > > > > > > > encounter the same problem with either Elasticsearch or > Solr as > > > far as > > > > > > > I > > > > > > > can tell. > > > > > > > > > > > > > > *Problems* > > > > > > > > > > > > > > 1. Lucene, including the latest version (6.2.1), appears > to have > > > a hard > > > > > > > coded maximum term length of 32766 (reference > > > > > > > < > > > > > > > > > > > https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH > > > > > > > > > > > > > > > here <%22> for > > > > > > > < > > > https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766> > > > > > > > details <https://github.com/apache/lucene-solr/searc>). > If the > > > > > > > indexingBolt attempts to input a non_analyzed > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2 > > > > > > > > > > > > > > > tuple (set via these files <https://github.co>) which > exceeds > > > that > > > > > > > limit, > > > > > > > the entire message is rejected. > > > > > > > - If you simply analyze the field it reduces the size for > any > > > > > > > individual > > > > > > > term, but it also throws a wrench in your queries, when > you are > > > > > > > searching > > > > > > > for a match of that entire field. > > > > > > > > > > > > > > 2. From what I can tell, failures are only logged via > > > > > > > /var/log/elasticsearch/metron.log and in the Storm UI > under the > > > Bolt > > > > > > > Id's > > > > > > > "Last error" column. > > > > > > > - It looks like this is already partially documented as > > > METRON-307 > > > > > > > <https://issues.apache.org/jira/browse/METRON-307>. > > > > > > > > > > > > > > > > > > > > > From here on out I'm going to focus on Problem 1. > > > > > > > > > > > > > > > > > > > > > *Thoughts* > > > > > > > > > > > > > > 1. We could use multifield > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html > > > > > > > > > > > > > > > mappings to be able to do both a full and partial search > for > > > fields > > > > > > > that > > > > > > > exceed 32766 length. > > > > > > > > > > > > > > 2. Truncate fields in the indexingBolt to keep non-analyzed > > > values > > > > > > > below > > > > > > > the 32766 limit. > > > > > > > > > > > > > > 3. Ensure that any field with the ability to grow beyond > the > > > 32766 > > > > > > > limit is > > > > > > > analyzed, and that no single term surpasses the max term > limit. > > > > > > > > > > > > > > There are some other ways to fix the problem, such as to > not > > > store the > > > > > > > field > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html > > > > > > > >, > > > > > > > not index the field > > > > > > > <https://www.elastic.co/guide/en/elasticsearch/refe>, > ignore > > > > > > > fields larger than a set value > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html > > > > > > > >, > > > > > > > etc. but I personally see these as confusing (to the end > user) > > > and not > > > > > > > very > > > > > > > helpful. Others have brought up dynamic templates > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html > > > > > > > > > > > > > > > as well, but I haven't looked into them yet. > > > > > > > > > > > > > > > > > > > > > *Concerns* > > > > > > > > > > > > > > Thought #1 > > > > > > > > > > > > > > - My current favourite (this is also what logstash does), > but it > > > > > > > requires > > > > > > > that we analyze the whole message and store a truncated > version > > > of the > > > > > > > whole message as a single large term. If truncation > occurs we > > > would > > > > > > > need > > > > > > > to: > > > > > > > > > > > > > > - Add key-value pairs to the tuple that indicates that > it was > > > > > > > truncated, what field(s) was/were truncated, the > pre-truncated > > > size of > > > > > > > the > > > > > > > field(s), hash of pre-truncated field, and a timestamp of > > > truncation > > > > > > > (i.e. > > > > > > > Data tampering). > > > > > > > > > > > > > > - Provide UI elements that clearly show that a specific > > > message was > > > > > > > truncated. > > > > > > > > > > > > > > - May need to abstract querying in the UI. If so, this > requires > > > a > > > > > > > sub-task > > > > > > > to METRON-195 and looking into an interim solution with > Kibana. > > > > > > > > > > > > > > - See "Other thoughts". > > > > > > > > > > > > > > > > > > > > > Thought #2 > > > > > > > > > > > > > > - If we go this path we’d need to address how to do a full > > > string match > > > > > > > (i.e. abstract a copy/paste of a > 32766 length URI to use > as a > > > query > > > > > > > in > > > > > > > the UI). This may or may not be possible with Kibana – if > not, > > > this > > > > > > > needs > > > > > > > to be a subtask in METRON-195. > > > > > > > > > > > > > > - Add key-value pairs to the tuple that indicates that it > was > > > > > > > truncated, > > > > > > > what field(s) was/were truncated, the pre-truncated size > of the > > > > > > > field(s), > > > > > > > hash of pre-truncated field, and a timestamp of truncation > (i.e. > > > Data > > > > > > > tampering). > > > > > > > > > > > > > > - Provide UI elements that clearly show that a specific > message > > > was > > > > > > > truncated. > > > > > > > > > > > > > > - See "Other thoughts". > > > > > > > > > > > > > > > > > > > > > Thought #3 > > > > > > > > > > > > > > - Not a huge fan of this solution because of how it > affects whole > > > > > > > string > > > > > > > matching. > > > > > > > > > > > > > > - May need a custom analyzer to cut up the URI properly. > Here > > > > > > > <%22http> are > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html > > > > > > > > > > > > > > > some > > > > > > > < > > > > > > > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-> > > > > > > > relevant < > > > http://docs.oracle.com/javase/7/docs/api/java/net/URL.html> > > > > > > > materials > > > > > > > < > > > http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html> > > > > > > > if > > > > > > > <https://tools.ietf.org/html/rfc398> we > > > > > > > < > https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3> > > > go > > > > > > > < > > > > http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html> > > > > > > > that > > > > > > > <http://www.regexplanet.com/advanced/java/index.html> path > > > > > > > < > > > > http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html > > > > > > > >. > > > > > > > > > > > > > > > > > > > > > *Other thoughts* > > > > > > > > > > > > > > - Maybe we can use the profiler to monitor truncated=True > and > > > watch for > > > > > > > people messing with this on purpose. > > > > > > > > > > > > > > - We could add a persistent UUID to every tuple and map > HDFS > > > against > > > > > > > Elasticsearch data. This could be used by a UI/frontend > to query > > > > > > > across > > > > > > > both datastores. Very useful in the case of truncation - > > > provide a > > > > > > > configurable setting that is false by default, but if set > to > > > true it > > > > > > > will > > > > > > > query HDFS for data it got which has truncated:true in the > > > indexed > > > > > > > store. > > > > > > > > > > > > > > > > > > > > > I have more thoughts but this has gotten more than long > enough > > > already > > > > > > > and > > > > > > > I wanted to send it off today. Thoughts? > > > > > > > Jon > > > > > > > -- > > > > > > > > > > > > > > Jon > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Jon > > > > > > > > > > > > > > > > > > -- > > > > Jon > > > > > > -- Jon
