Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

Carolyn Duby Wed, 27 Jun 2018 13:42:19 -0700

Another reason for the original string is that you may not want to extract all 
components of the original event into JSON.  If you look at Windows events you 
will want to have the original event but you will not want to extract 
everything because they are very verbose.


You should have a choice on the sensor type whether you want to include the 
original string in the index not.

Thanks  

Carolyn Duby
Solutions Engineer, Northeast
[email protected]
+1.508.965.0584

Join my team!
Enterprise Account Manager – Boston - http://grnh.se/wepchv1
Solutions Engineer – Boston - http://grnh.se/8gbxy41
Need Answers? Try https://community.hortonworks.com 
<https://community.hortonworks.com/answers/index.html>








On 6/25/18, 8:02 PM, "Simon Elliston Ball" <[email protected]> wrote:

>The original string serves purposes well beyond debugging. Many users will
>need to be able to prove provenance to the raw logs in order to prove or
>prosecute an attack from an internal threat, or provide evidence to law
>enforcement or an external threat. As such, the original string is
>important.
>
>It also provides a valuable source for the free text search where parsing
>has not extracted all the necessary tokens for a hunt use case, so it can
>be a valuable field to have in Elastic or Solr for text rather than keyword
>indexing.
>
>That said, it may make sense to remove a heavy weight processing and
>storage field like this from the lucene store. We have been talking for a
>while about filtering some of the data out of the realtime index, and
>preserving full copies in the batch index, which could meet the forensic
>use cases above, and would make it a matter of user choice. That would
>probably be configured through indexing config to filter fields.
>
>Simon
>
>On 25 June 2018 at 23:43, Michel Sumbul <[email protected]> wrote:
>
>> Depending on the source of data, it might be interesting to bypass a step
>> that the user concider useless.
>> For example if you have a source of data that dont need profiling and you
>> want to have it ingested like the other source to allow the  SOC analyst to
>> use it in there analysis. To have everything at the same place.
>>
>> How can we bypass it for a specific sensor?
>>
>> 2018-06-25 23:38 GMT+01:00 James Sirota <[email protected]>:
>>
>> > There is a way to wire the system to bypass enrichment and profiling, but
>> > you would then bypass a lot of key features of the system.  It would be
>> > unwise to do that.
>> >
>> > 25.06.2018, 15:13, "Michel Sumbul" <[email protected]>:
>> > > Hi Casey,
>> > >
>> > > Thats make completely sense.
>> > > Short question, if there is no enrichment or no profiling, does the
>> > message
>> > > still pass through the enrichment/profiling topic?
>> > >
>> > > If yes, do you think its possible to imagine a way that for messages
>> that
>> > > doesn't need enrichment or profiling to skip the topic and to go
>> directly
>> > > to the next one? This is again to avoid in/out in kafka.
>> > >
>> > > Thanks for the explaination,
>> > > Michel
>> > >
>> > > 2018-06-23 3:58 GMT+01:00 Casey Stella <[email protected]>:
>> > >
>> > >>  Hey Michel,
>> > >>
>> > >>  Those are good questions and there were some reasons surrounding
>> that.
>> > In
>> > >>  fact, historically, we had fewer topologies (e.g. indexing and
>> > enrichment
>> > >>  were merged). Even earlier on, we had just one giant topology per
>> > parser
>> > >>  that enriched and indexed. The long story short is that we moved this
>> > way
>> > >>  because we saw how people were using metron and we gained more
>> insight
>> > >>  tuning Metron. That led us down this architectural path.
>> > >>
>> > >>  Some of the reasons that we went this way:
>> > >>
>> > >>     - Fewer large topologies were a nightmare to tune
>> > >>        - Enrichment would have different memory requirements than,
>> say,
>> > >>        parsers or indexing
>> > >>        - You can adjust the kafka topic params per topology to adjust
>> > the
>> > >>        number of partitions, etc.
>> > >>     - Having the separate topologies gives a natural set of extension
>> > points
>> > >>     for customization and enhancement (e.g. you want a phase between
>> > parsing
>> > >>     and enrichment).
>> > >>     - Decoupling the topologies lets us spin up and down parts of
>> Metron
>> > >>     without affecting others (e.g. you don't have to take down
>> > enrichments
>> > >>  to
>> > >>     add a parser, even for a moment)
>> > >>     - The movement to Flux meant we were limited in how much we could
>> > adjust
>> > >>     the topology at runtime (e.g. colocating parsers and enrichment
>> > would
>> > >>  mean
>> > >>     moving away from flux essentially as the topology changes its
>> > structure)
>> > >>
>> > >>  Best,
>> > >>
>> > >>  Casey
>> > >>
>> > >>  On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul <
>> [email protected]>
>> > >>  wrote:
>> > >>
>> > >>  > Hi Everyone,
>> > >>  >
>> > >>  > I was asking myself what was the architectural reason to split the
>> > >>  > ingestion in metron in 4 differents toppologies that all read/write
>> > to
>> > >>  > kafka?
>> > >>  >
>> > >>  > For example, why the parsing and enrichment topologies have not
>> been
>> > >>  > merged? Would it not be possible when you parse the message to
>> > directly
>> > >>  > enricht it?
>> > >>  >
>> > >>  > Im asking that because splitting in several topologies means that
>> > all of
>> > >>  > the topologies read/write to Kafka, which produce a bigger load on
>> > the
>> > >>  > kafka cluster and then a need for way more infrastructure/servers.
>> > The
>> > >>  cost
>> > >>  > is especially true when we speak about TBs of data ingested every
>> > day.
>> > >>  >
>> > >>  > Im sure there were a very good reason, I was just curious.
>> > >>  >
>> > >>  > Thanks,
>> > >>  > Michel
>> > >>  >
>> >
>> > -------------------
>> > Thank you,
>> >
>> > James Sirota
>> > PMC- Apache Metron
>> > jsirota AT apache DOT org
>> >
>> >
>>
>
>
>
>-- 
>--
>simon elliston ball
>@sireb

Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

Reply via email to