Re: Long-term storage for enriched data

Kyle Richardson Fri, 06 Jan 2017 15:36:55 -0800

Yep. Exactly.

Ok, cool. I'll file a couple of JIRAs to get the ball rolling.


-Kyle

> On Jan 6, 2017, at 5:21 PM, [email protected] <[email protected]> wrote:
> 
> I think we can use the ES templates to start off, as Avro (HDFS) and ES
> should be in sync.  There already is some normalization in place for field
> names (ip_src_addr, etc.) so that enrichment can work across any log type.
> This is currently handled in the parsers.
> 
> I think what you're talking about here is simply expanding that to handle
> more fields, right?  I'm all for that - the more normalized the data is the
> more fun I can have with it :)
> 
> Jon
> 
> On Fri, Jan 6, 2017, 4:26 PM Kyle Richardson <[email protected]>
> wrote:
> 
>> You're right. I don't think it needs to account for everything. As I
>> understand it, one of the big selling features for Avro is the schema
>> evolution.
>> 
>> Do you think we can take the ES templates as a starting point for
>> developing an Avro schema? I do still think we need some type of
>> normalization across the sensors for fields like URL, user name, and
>> disposition. This wouldn't be specific to Avro but would allow us to better
>> search across multiple sensor types in the UI too. Say, for example, if I
>> have two different proxy solutions.
>> 
>> -Kyle
>> 
>>> On Fri, Jan 6, 2017 at 2:28 PM, [email protected] <[email protected]> wrote:
>>> 
>>> Does it really need to account for all enrichments off the bat?  I'm not
>>> familiar with these options in practice but my research led me to believe
>>> that adding fields to the Avro schema is not a huge issue, changing or
>>> removing them is the true problem.  I have no proof to substantiate my
>>> claim however, just that I heard that question get asked, and I read
>>> responses from people familiar with Avro reply uniformly in that way.
>>> 
>>> My thoughts, based off of my assumption, is that we simply need to handle
>>> out of the box enrichments and document a required schema change in our
>>> guides to creating custom enrichments.
>>> 
>>> In ES we are currently doing one template per sensor which gives us that
>>> overlapping field name (per sensor) flexibility.
>>> 
>>> Jon
>>> 
>>> On Fri, Jan 6, 2017, 12:33 PM Kyle Richardson <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Thanks, Jon. Really interesting talk.
>>>> 
>>>> For the GitHub data set discussed (which probably most closely mimics
>>>> Metron data due to number of fields and overall diversity), Avro with
>>>> Snappy compression seemed like the best balance of storage size and
>>>> retrieval time. I did find it interesting that he said Parquet was
>>>> originally developed for log data sets but didn't perform as well on
>> the
>>>> GitHub data.
>>>> 
>>>> I think our challenge is going to be on the schema. Would we create a
>>>> schema per sensor type and try to account for all of the possible
>>>> enrichments? Problem there is that similar data may not be mapped to
>> the
>>>> same field names across sensors. We may need to think about expanding
>> our
>>>> base JSON schema beyond these 7 fields (
>>>> https://cwiki.apache.org/confluence/display/METRON/Metron+JSON+Object)
>>> to
>>>> account for normalizing things like URL, user name, and disposition
>> (e.g.
>>>> whether an action was allowed or denied).
>>>> 
>>>> Thoughts?
>>>> 
>>>> -Kyle
>>>> 
>>>> On Tue, Jan 3, 2017 at 11:30 AM, [email protected] <[email protected]>
>>>> wrote:
>>>> 
>>>>> For those interested, I ended up finding a recording of the talk
>> itself
>>>>> when doing some Avro research - https://www.youtube.com/watch?
>>>>> v=tB28rPTvRiI
>>>>> 
>>>>> Jon
>>>>> 
>>>>>> On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <[email protected]> wrote:
>>>>>> 
>>>>>> I’m not an expert on these things, but my understanding is that
>> Avro
>>>> and
>>>>>> ORC serve many of the same needs.  The biggest difference is that
>> ORC
>>>> is
>>>>>> columnar, and Avro isn’t.  Avro, ORC, and Parquet were compared in
>>>> detail
>>>>>> at last year’s Hadoop Summit; the slideshare prezo is here:
>>>>>> http://www.slideshare.net/HadoopSummit/file-format-
>>>>> benchmark-avro-json-orc-parquet
>>>>>> 
>>>>>> It’s conclusion: “For complex tables with common strings, Avro with
>>>>> Snappy
>>>>>> is a good fit.  For other tables [or when applications “just need a
>>> few
>>>>>> columns” of the tables], ORC with Zlib is a good fit.”  (The
>> addition
>>>> in
>>>>>> square brackets incorporates a quote from another part of the
>> prezo.)
>>>>> But
>>>>>> do look at the prezo please, it gives detailed benchmarks showing
>>> when
>>>>> each
>>>>>> one is better.
>>>>>> 
>>>>>> --Matt
>>>>>> 
>>>>>> On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote:
>>>>>> 
>>>>>>    I don't recall a conversation on that product specifically, but
>>>> I've
>>>>>>    definitely brought up the need to search HDFS from time to
>> time.
>>>>>> Things
>>>>>>    like Spark SQL, Hive, Oozie have been discussed, but Avro is
>> new
>>> to
>>>>> me
>>>>>> I'll
>>>>>>    have to look into it.  Are you able to summarize it's benefits?
>>>>>> 
>>>>>>    Jon
>>>>>> 
>>>>>>    On Wed, Dec 28, 2016, 14:45 Kyle Richardson <
>>>>> [email protected]
>>>>>>> 
>>>>>>    wrote:
>>>>>> 
>>>>>>> This thread got me thinking... there are likely a fair number
>>> of
>>>>> use
>>>>>> cases
>>>>>>> for searching and analyzing the output stored in HDFS. Dima's
>>> use
>>>>>> case is
>>>>>>> certainly one. Has there been any discussion on the use of
>> Avro
>>>> to
>>>>>> store
>>>>>>> the output in HDFS? This would likely require an expansion of
>>> the
>>>>>> current
>>>>>>> json schema.
>>>>>>> 
>>>>>>> -Kyle
>>>>>>> 
>>>>>>> On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <
>>>> [email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Oozie (or something like it) would appear to me to be the
>>>> correct
>>>>>> tool
>>>>>>>> here.  You are likely moving files around and pinning up
>> hive
>>>>>> tables:
>>>>>>>> 
>>>>>>>>   - Moving the data written in HDFS from
>>>>>> /apps/metron/enrichment/${
>>>>>>>> sensor}
>>>>>>>>   to another directory in HDFS
>>>>>>>>   - Running a job in Hive or pig or spark to take the JSON
>>>>> blobs,
>>>>>> map
>>>>>>> them
>>>>>>>>   to rows and pin it up as an ORC table for downstream
>>>> analytics
>>>>>>>> 
>>>>>>>> NiFi is mostly about getting data in the cluster, not
>> really
>>>> for
>>>>>>> scheduling
>>>>>>>> large-scale batch ETL, I think.
>>>>>>>> 
>>>>>>>> Casey
>>>>>>>> 
>>>>>>>> On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thank you for reply Carolyn,
>>>>>>>>> 
>>>>>>>>> Currently for the test purposes we enrich flow with Geo
>> and
>>>>>> ThreatIntel
>>>>>>>>> malware IP, but plan to expand this further.
>>>>>>>>> 
>>>>>>>>> Our dev team is working on Oozie job to process this. So
>>>>>> meanwhile I
>>>>>>>>> wonder if I could use NiFi for this purpose (because we
>>>> already
>>>>>> using
>>>>>>> it
>>>>>>>>> for data ingest and stream).
>>>>>>>>> 
>>>>>>>>> Could you elaborate why it may be overkill? The idea is
>> to
>>>> have
>>>>>>>>> everything in one place instead of hacking into Metron
>>>>> libraries
>>>>>> and
>>>>>>>> code.
>>>>>>>>> 
>>>>>>>>> - Dima
>>>>>>>>> 
>>>>>>>>>> On 12/22/2016 02:26 AM, Carolyn Duby wrote:
>>>>>>>>>> Hi Dima -
>>>>>>>>>> 
>>>>>>>>>> What type of analytics are you looking to do?  Is the
>>>>>> normalized
>>>>>>> format
>>>>>>>>> not working?  You could use an oozie or spark job to
>> create
>>>>>> derivative
>>>>>>>>> tables.
>>>>>>>>>> 
>>>>>>>>>> Nifi may be overkill for breaking up the kafka stream.
>>>> Spark
>>>>>>> streaming
>>>>>>>>> may be easier.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Carolyn
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Sent from my Verizon, Samsung Galaxy smartphone
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -------- Original message --------
>>>>>>>>>> From: Dima Kovalyov <[email protected]>
>>>>>>>>>> Date: 12/21/16 6:28 PM (GMT-05:00)
>>>>>>>>>> To: [email protected]
>>>>>>>>>> Subject: Long-term storage for enriched data
>>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> Currently we are researching fast and resources
>> efficient
>>>> way
>>>>>> to save
>>>>>>>>>> enriched data in Hive for further Analytics.
>>>>>>>>>> 
>>>>>>>>>> There are two scenarios that we consider:
>>>>>>>>>> a) Use Ozzie Java job that uses Metron enrichment
>> classes
>>>> to
>>>>>>> "manually"
>>>>>>>>>> enrich each line of the source data that is picked up
>>> from
>>>>> the
>>>>>> source
>>>>>>>>>> dir (the one that we have developed already and using).
>>>> That
>>>>> is
>>>>>>>>>> something that we developed on our own. Downside:
>> custom
>>>> code
>>>>>> that
>>>>>>>> built
>>>>>>>>>> on top of Metron source code.
>>>>>>>>>> 
>>>>>>>>>> b) Use NiFi to listen for indexing Kafka topic -> split
>>>>> stream
>>>>>> by
>>>>>>>> source
>>>>>>>>>> type -> Put every source type in corresponding Hive
>>> table.
>>>>>>>>>> 
>>>>>>>>>> I wonder, if someone was going any of this direction
>> and
>>> if
>>>>>> there are
>>>>>>>>>> best practices for this? Please advise.
>>>>>>>>>> Thank you.
>>>>>>>>>> 
>>>>>>>>>> - Dima
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>    --
>>>>>> 
>>>>>>    Jon
>>>>>> 
>>>>>>    Sent from my mobile device
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>> 
>>>>> Jon
>>>>> 
>>>>> Sent from my mobile device
>>>>> 
>>>> 
>>> --
>>> 
>>> Jon
>>> 
>>> Sent from my mobile device
>>> 
>> 
> -- 
> 
> Jon
> 
> Sent from my mobile device

Re: Long-term storage for enriched data

Reply via email to