RE- METRON-485

I believe that there are a couple of issues here.

1.  We don’t use the -w timeout parameter when killing the topologies,
which means technically we may not get out cleanly.  We should change this.
2. Beyond the storm timeouts monit itself has timeouts and will ‘kill’ the
scripts itself if they don’t complete.

I believe that I have seen this happening in resource constrained testing
done with the Storm 1.0 work.

2 competing timeouts/settings here are a real yellow flag.

If the long term fix is the move from monit to ambari, that is fine.  But
in the mean time, getting something in to make this issue better ( along
with other work done for the quick and full recently ) is worth doing in my
opinion.

On November 4, 2016 at 06:31:04, [email protected] ([email protected]) wrote:

Please understand that my points mostly relate to perception and ease of
use, not what's technically possible or available. I'm coming at this as
Metron should be a data analysis platform for the masses.

METRON-517/542 - While I'm willing to let this one go it depends on your
definition of non-issue. I personally believe that data (in every location
that it exists) needs to be obvious and have ultra high integrity. I'm not
concerned that the correct data won't exist somewhere in the cluster, I'm
focusing on it being easily accessible by an operations team that may
consist of entry level analysts. Once 517 is done and merged I would
consider that a short term mitigation is in place.

I feel like the project should stick to certain principles and a suggestion
is that data access is easy, accurate, and obvious. Do we have anything
like this that was agreed upon, discussed, or documented? Probably a
discussion for a different thread.

METRON-485/470/etc. were mostly to illustrate a consistency issue that and
resolving them would give a better first impression (assuming that people
monitoring the project will start using it more once it's non-BETA
software). First impressions are big on my book and could affect initial
adoption.

Regarding 485 - Otto may be able to clarify but I thought somebody else saw
this issue as well. I think the finger is currently being pointed at monit
timeouts and not storm. It also doesn't happen every single time, I only
run into it while the cluster is under load and after dozens of topology
restarts that I do when tuning parallelism in storm. I'm going to be
updating to storm 1.0.x in order to see if this still exists. Again, this
relates to ease of use/load testing/tuning.

Agree with the upgrade comments - as long as it's supported at some defined
point (IMHO this is when a project leaves BETA but others are welcome to
disagree).

Finally, I know this doesn't come across well in email but I'm just
mentioning items which I think are important, not attempting to demand that
they be fixed or that this doesn't leave beta. Thanks,

Jon

On Thu, Nov 3, 2016, 16:44 James Sirota <[email protected]> wrote:


Hi Jon,

Here are my thoughts around your objections.

METRON-517/METRON-542

I thin the mechanism currently exists within Metron to make this a
non-issue. I believe you can solve it with a combination of a Stellar
statement and ES templates. As you mentioned, we can truncate the string
and then include the relevant meta data in the message (original length,
hash, etc). Cramming really long strings into ES is generally a bad thing,
which is why this limitation exists. The metadata in the indexed message
along with the timestamp allows you to pull data from HDFS should you need
to recover the full string.

METRON-485

We cannot replicate this issue in our environment, but if this is indeed an
issue this is an issue with Storm. A Jira should be filed against Storm
and not against Metron. My hunch, though, is that it's probably something
in your environment. I just tried stopping all topologies on my AWS
cluster and then went to all Storm nodes and didn't see any workers left
behind.

METRON-470

I think this is mainly a consistency issue. I don't think this impacts the
stability or function of the software. I think this is a nice to have,
maybe in the next few releases, but I don't think we absolutely have to
have this to drop BETA

With respect to upgrades, here are my thoughts. There is really no way to
upgrade Metron 0.2.1 to Metron 0.2.2 in place because it requires a change
of HDP. The new build will only be compatible with HDP 2.5 and not 2.4.
So you have to lay down a new cluster regardless. We can document how to
get the configs off of your old Metron and plug them into your new Metron
so that it works the same. That shouldn't be a problem.

Our upgrade path for future releases will revolve around the Ambari Metron
management pack that is available with the upcoming build. Right now the
install capability is available and the upgrade capability will come in
incrementally within the next few release. We will additionally deprecate
Monit and switch that functionality to Ambari as well. Finally, we will
also use Ambari for metrics monitoring. There is lots to do so we will
triage and prioritize Jiras as a community to see which parts we want to
tackle first. This is why your participation in the community is so
valuable.

Thanks,
James



03.11.2016, 11:07, "[email protected]" <[email protected]>:
> I agree that we can split METRON-517 into a short term and long term fix.
> I have attempted to organize my thoughts regarding the long term fix into
> METRON-542 and can get a PR out for METRON-517 soon to close that out.
>
> This leaves cluster tuning and a valid upgrade path for users, the latter
of
> which is my predominant concern. If the team is willing to say that
> starting with 0.2.2 there will be a valid upgrade path to future releases
I
> think that removing the BETA tag at 0.2.2 is reasonable. That said, this
> is just following my perception of what the BETA tag represents.
>
> Jon
>
> On Thu, Nov 3, 2016 at 11:50 AM Casey Stella <[email protected]> wrote:
>
>> Ok, regarding METRON-517, I've thought about this a bit having read your
>> really great and detailed JIRA as well as the discussion around this on
the
>> dev list between you and Matt Foley. I want to separate the discussion
>> between what is the correct long-term solution for this issue versus
what
>> is an acceptable solution.
>>
>> In terms of an acceptable work-around, my opinion is that because we
allow
>> the user to modify the ES template they can
>>
>> - Adjust the template to specify ignore_above
>> <
>>
https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html
>> >
>> on
>> fields which they feel are likely to be large (maybe every string
field)
>> - The combination of timestamp and ip_src_addr should be sufficient
for
>> picking out the raw data in question from the HDFS store
>> - A stellar enrichment can be used to tag the messages with large
URIs
>> and that can factor into the threat triage even or be used to filter
in
>> kibana
>> - As you say, you can use the profiler to track counts of such
messages
>> if you so desire and factor that into threat alerting or filtering in
>> kibana.
>>
>> Ultimately, I believe we have exposed the appropriate set of tooling to
>> provide an acceptable solution for the moment. Now, as for the best
>> long-term solution, I will let the good discussion on the mailing list
and
>> JIRA continue and contribute my thoughts on the JIRA
>> <https://issues.apache.org/jira/browse/METRON-517>.
>>
>> Of course, this is just $0.02 :)
>>
>> Apologies to Dave, I wanted to mark this aspect of the discussion on
this
>> thread as it is relevant to sufficient criteria to remove the BETA tag.
>>
>> Best,
>>
>> Casey
>>
>> On Thu, Nov 3, 2016 at 7:26 AM, [email protected] <[email protected]>
wrote:
>>
>> > To clarify, it only needs to truncate fields > 32766 which need a
>> > full/exact string match search to be run on them (analyzed fields
>> generally
>> > would not hit this limitation but I guess in theory they could).
>> However,
>> > that's probably every field which can get > 32766 because I'm assuming
>> > those will all be strings.
>> >
>> > I also think using the profiler to monitor the truncation action could
>> be a
>> > useful default.
>> >
>> > Jon
>> >
>> > On Wed, Nov 2, 2016, 21:08 [email protected] <[email protected]> wrote:
>> >
>> > > That would break searching on uri entirely unless you queried and
knew
>> to
>> > > truncate at 32766 because it's not analyzed. I don't like pushing
that
>> > > complication to the end user.
>> > >
>> > > I would suggest truncation in the indexingBolt (not using stellar
>> because
>> > > you'd want this across the board) for all fields > 32766 (how do we
>> make
>> > > sure this gets updated if the limitation changes in Lucene?) and
adding
>> > > metadata key-value pairs (pre-trunc length, hash, truncated bool,
>> etc.).
>> > > In the URI scenario I would also suggest doing a multifield mapping
by
>> > > default because of the way that data is useful (not sure which
analyser
>> > to
>> > > use though - maybe write or find a good URI analyzer?). Since
>> timestamp
>> > is
>> > > a required field for all messages (I'm pretty sure?) I'm ok with
>> > timestamp
>> > > and field value used as the UID, but would prefer something better.
>> > >
>> > > Jon
>> > >
>> > > On Wed, Nov 2, 2016, 20:33 James Sirota <[email protected]> wrote:
>> > >
>> > > Jon,
>> > >
>> > > For METRON-517 would it suffice to have a stellar statement to take
a
>> URI
>> > > string and truncate it to length of 32766 in the ES writer? But
still
>> > > write the actual string to HDFS? You can then search against ES on
the
>> > > truncated portion, but retrieve the actual timestamp from HDFS. It's
>> > easy
>> > > to do because you know the timestamp from the original message. So
you
>> > > know which logs in HDFS to search through to find the data.
>> > >
>> > > 02.11.2016, 14:12, "[email protected]" <[email protected]>:
>> > > > I personally would like to see the following things done before
>> things
>> > > > leave BETA:
>> > > > (1) Address data integrity concerns (Specifically thinking of
>> > METRON-370,
>> > > > METRON-517)
>> > > > (2) Make cluster tuning easier and more consistent (METRON-485,
>> > > METRON-470,
>> > > > and the "[DISCUSS] moving parsers back to flux" which I can't
find a
>> > JIRA
>> > > > for).
>> > > >
>> > > > I would also want to see the upgrade path (as opposed to rebuild)
be
>> > more
>> > > > thoroughly and regularly tested once things leave BETA. From my
>> > > > perspective I think the project is very close but not yet ready.
>> > > >
>> > > > Jon
>> > > >
>> > > > On Wed, Nov 2, 2016 at 4:44 PM Casey Stella <[email protected]>
>> > wrote:
>> > > >
>> > > > Hello Everyone,
>> > > >
>> > > > Now that the discussion around the next release has started, it
has
>> > been
>> > > > proposed and I think it's a good time to discuss what to name this
>> next
>> > > > release. Before, we have adopted the BETA suffix. I think it
might be
>> > > > time to drop it and call the next release 0.2.2
>> > > >
>> > > > Thoughts?
>> > > >
>> > > > Best,
>> > > >
>> > > > Casey
>> > > >
>> > > > --
>> > > >
>> > > > Jon
>> > >
>> > > -------------------
>> > > Thank you,
>> > >
>> > > James Sirota
>> > > PPMC- Apache Metron (Incubating)
>> > > jsirota AT apache DOT org
>> > >
>> > > --
>> > >
>> > > Jon
>> > >
>> > --
>> >
>> > Jon
>> >
> --
>
> Jon

-------------------
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

-- 

Jon

Reply via email to