Based on this design, I submitted PR 208 [1]. Sometimes it is easier to judge something when you have real code to see (rather than my mangled grammar in a design doc.) If anyone has feedback, please let me know.
There was one change to the original design. The original topology design looks something like the following. [Kafka Topic] -> KafkaSpout -> ProfileSplitter -> ProfileBuilder -> [HBase Table] There is a separate instance of a ProfileBuilder for each Profile-Entity pair. For example, if I have a single profile called Profile1 and based on the data coming in there are 2 entities/hosts associated with the profile, then there would be two different ProfileBuilder instances. The first for Profile1-Host1 and the second for Profile1-Host2. Based on the flush interval (but roughly every 15 minutes) each of these instances would flush their data to HBase. This provides little to no opportunity to optimize the writes to HBase as each is writing on its own schedule. This can become really problematic when the number of profiles increases or the number of entities for any single profile increases. (The number of entities increasing being a little more scary of a problem because that could happen just based on changes in the data being received, not necessarily any action that you knowingly took to add profiles.) The design change involved adding a separate bolt responsible for writing to HBase. Having a separate bolt in the topology allows the writes to be aggregated and optimized. For example, I can batch the writes from multiple ProfileBuilder bolts and write in a single batch. [Kafka Topic] -> KafkaSpout -> ProfileSplitter -> ProfileBuilder -> HBaseBolt -> [HBase Table] I think this is a common pattern that occurs in many uses cases. At least it occurs frequently enough for Storm to create some dedicated code to handle this; storm-hbase [2]. Unfortunately, I was not able to use the storm-hbase code because of the versions of Storm and HBase that we use with Metron. The version of Storm we use only supports HBase 0.98.x and only in the very latest versions of Storm did they bump support up to newer versions of HBase. Trust me, I tried to make it work. It felt icky, but I had to roll my own. I kept that code isolated enough that we can swap in storm-hbase should that ever be a possibility. [1] https://github.com/apache/incubator-metron/pull/208 [2] https://github.com/apache/storm/tree/master/external/storm-hbase On Fri, Aug 5, 2016 at 10:54 AM, Nick Allen <[email protected]> wrote: > https://issues.apache.org/jira/browse/METRON-309 > > > > On Fri, Aug 5, 2016 at 8:58 AM, Casey Stella <[email protected]> wrote: > >> I don't think the attachment came through, Nick. Can you post the PDF on >> the JIRA? >> >> On Wed, Aug 3, 2016 at 4:22 PM, Nick Allen <[email protected]> wrote: >> >> > I have been thinking through the implementation of something that I am >> > calling the "Entity Profiler." The idea/concept was passed on to me by >> > James Sirota and I think it would be very useful as a part of Metron. >> > >> > I have a draft design that I would love to get feedback on. Please see >> > the attached PDF. If anything is not clear, please let me know. >> > >> > *The Entity Profiler is a feature extraction mechanism that can capture >> a >> > Profile that describes any Entity on a network. The Entity might be a >> > server, user, subnet or application. The Profile itself is simply a >> time >> > series of numeric values. * >> > >> > >> > >> > *The Entity Profiler will enable feature extraction using sliding >> windows >> > over streaming telemetry data. The Entity Profiler will enable a >> summary >> > statistic to be applied to raw data over a given time horizon. >> Collecting >> > these values across many time horizons results in a time series that is >> > useful for analysis.* >> > >> > >> > >> > Hopefully that is enough of a tease to gain your interest. >> > >> > Thanks >> > >> > >> > >> > -- >> > Nick Allen <[email protected]> >> > >> > > > > -- > Nick Allen <[email protected]> > -- Nick Allen <[email protected]>
