I agree that we can split METRON-517 into a short term and long term fix.
I have attempted to organize my thoughts regarding the long term fix into
METRON-542 and can get a PR out for METRON-517 soon to close that out.

This leaves cluster tuning and a valid upgrade path for users, the later of
which is my predominant concern.  If the team is willing to say that
starting with 0.2.2 there will be a valid upgrade path to future releases I
think that removing the BETA tag at 0.2.2 is reasonable.  That said, this
is just following my perception of what the BETA tag represents.

Jon

On Thu, Nov 3, 2016 at 11:50 AM Casey Stella <[email protected]> wrote:

> Ok, regarding METRON-517, I've thought about this a bit having read your
> really great and detailed JIRA as well as the discussion around this on the
> dev list between you and Matt Foley.  I want to separate the discussion
> between what is the correct long-term solution for this issue versus what
> is an acceptable solution.
>
> In terms of an acceptable work-around, my opinion is that because we allow
> the user to modify the ES template they can
>
>    - Adjust the template to specify ignore_above
>    <
> https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html
> >
> on
>    fields which they feel are likely to be large (maybe every string field)
>    - The combination of timestamp and ip_src_addr should be sufficient for
>    picking out the raw data in question from the HDFS store
>    - A stellar enrichment can be used to tag the messages with large URIs
>    and that can factor into the threat triage even or be used to filter in
>    kibana
>    - As you say, you can use the profiler to track counts of such messages
>    if you so desire and factor that into threat alerting or filtering in
>    kibana.
>
> Ultimately, I believe we have exposed the appropriate set of tooling to
> provide an acceptable solution for the moment.  Now, as for the best
> long-term solution, I will let the good discussion on the mailing list and
> JIRA continue and contribute my thoughts on the JIRA
> <https://issues.apache.org/jira/browse/METRON-517>.
>
> Of course, this is just $0.02 :)
>
> Apologies to Dave, I wanted to mark this aspect of the discussion on this
> thread as it is relevant to sufficient criteria to remove the BETA tag.
>
> Best,
>
> Casey
>
> On Thu, Nov 3, 2016 at 7:26 AM, [email protected] <[email protected]> wrote:
>
> > To clarify, it only needs to truncate fields > 32766 which need a
> > full/exact string match search to be run on them (analyzed fields
> generally
> > would not hit this limitation but I guess in theory they could).
> However,
> > that's probably every field which can get > 32766 because I'm assuming
> > those will all be strings.
> >
> > I also think using the profiler to monitor the truncation action could
> be a
> > useful default.
> >
> > Jon
> >
> > On Wed, Nov 2, 2016, 21:08 [email protected] <[email protected]> wrote:
> >
> > > That would break searching on uri entirely unless you queried and knew
> to
> > > truncate at 32766 because it's not analyzed.  I don't like pushing that
> > > complication to the end user.
> > >
> > > I would suggest truncation in the indexingBolt (not using stellar
> because
> > > you'd want this across the board) for all fields > 32766 (how do we
> make
> > > sure this gets updated if the limitation changes in Lucene?) and adding
> > > metadata key-value pairs (pre-trunc length, hash, truncated bool,
> etc.).
> > > In the URI scenario I would also suggest doing a multifield mapping by
> > > default because of the way that data is useful (not sure which analyser
> > to
> > > use though - maybe write or find a good URI analyzer?).  Since
> timestamp
> > is
> > > a required field for all messages (I'm pretty sure?) I'm ok with
> > timestamp
> > > and field value used as the UID, but would prefer something better.
> > >
> > > Jon
> > >
> > > On Wed, Nov 2, 2016, 20:33 James Sirota <[email protected]> wrote:
> > >
> > > Jon,
> > >
> > > For METRON-517 would it suffice to have a stellar statement to take a
> URI
> > > string and truncate it to length of 32766 in the ES writer?  But still
> > > write the actual string to HDFS? You can then search against ES on the
> > > truncated portion, but retrieve the actual timestamp from HDFS.  It's
> > easy
> > > to do because you know the timestamp from the original message.  So you
> > > know which logs in HDFS to search through to find the data.
> > >
> > > 02.11.2016, 14:12, "[email protected]" <[email protected]>:
> > > > I personally would like to see the following things done before
> things
> > > > leave BETA:
> > > > (1) Address data integrity concerns (Specifically thinking of
> > METRON-370,
> > > > METRON-517)
> > > > (2) Make cluster tuning easier and more consistent (METRON-485,
> > > METRON-470,
> > > > and the "[DISCUSS] moving parsers back to flux" which I can't find a
> > JIRA
> > > > for).
> > > >
> > > > I would also want to see the upgrade path (as opposed to rebuild) be
> > more
> > > > thoroughly and regularly tested once things leave BETA. From my
> > > > perspective I think the project is very close but not yet ready.
> > > >
> > > > Jon
> > > >
> > > > On Wed, Nov 2, 2016 at 4:44 PM Casey Stella <[email protected]>
> > wrote:
> > > >
> > > > Hello Everyone,
> > > >
> > > > Now that the discussion around the next release has started, it has
> > been
> > > > proposed and I think it's a good time to discuss what to name this
> next
> > > > release. Before, we have adopted the BETA suffix. I think it might be
> > > > time to drop it and call the next release 0.2.2
> > > >
> > > > Thoughts?
> > > >
> > > > Best,
> > > >
> > > > Casey
> > > >
> > > > --
> > > >
> > > > Jon
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> > > --
> > >
> > > Jon
> > >
> > --
> >
> > Jon
> >
>
-- 

Jon

Reply via email to