Yep. Exactly. Ok, cool. I'll file a couple of JIRAs to get the ball rolling.
-Kyle > On Jan 6, 2017, at 5:21 PM, [email protected] <[email protected]> wrote: > > I think we can use the ES templates to start off, as Avro (HDFS) and ES > should be in sync. There already is some normalization in place for field > names (ip_src_addr, etc.) so that enrichment can work across any log type. > This is currently handled in the parsers. > > I think what you're talking about here is simply expanding that to handle > more fields, right? I'm all for that - the more normalized the data is the > more fun I can have with it :) > > Jon > > On Fri, Jan 6, 2017, 4:26 PM Kyle Richardson <[email protected]> > wrote: > >> You're right. I don't think it needs to account for everything. As I >> understand it, one of the big selling features for Avro is the schema >> evolution. >> >> Do you think we can take the ES templates as a starting point for >> developing an Avro schema? I do still think we need some type of >> normalization across the sensors for fields like URL, user name, and >> disposition. This wouldn't be specific to Avro but would allow us to better >> search across multiple sensor types in the UI too. Say, for example, if I >> have two different proxy solutions. >> >> -Kyle >> >>> On Fri, Jan 6, 2017 at 2:28 PM, [email protected] <[email protected]> wrote: >>> >>> Does it really need to account for all enrichments off the bat? I'm not >>> familiar with these options in practice but my research led me to believe >>> that adding fields to the Avro schema is not a huge issue, changing or >>> removing them is the true problem. I have no proof to substantiate my >>> claim however, just that I heard that question get asked, and I read >>> responses from people familiar with Avro reply uniformly in that way. >>> >>> My thoughts, based off of my assumption, is that we simply need to handle >>> out of the box enrichments and document a required schema change in our >>> guides to creating custom enrichments. >>> >>> In ES we are currently doing one template per sensor which gives us that >>> overlapping field name (per sensor) flexibility. >>> >>> Jon >>> >>> On Fri, Jan 6, 2017, 12:33 PM Kyle Richardson <[email protected] >>> >>> wrote: >>> >>>> Thanks, Jon. Really interesting talk. >>>> >>>> For the GitHub data set discussed (which probably most closely mimics >>>> Metron data due to number of fields and overall diversity), Avro with >>>> Snappy compression seemed like the best balance of storage size and >>>> retrieval time. I did find it interesting that he said Parquet was >>>> originally developed for log data sets but didn't perform as well on >> the >>>> GitHub data. >>>> >>>> I think our challenge is going to be on the schema. Would we create a >>>> schema per sensor type and try to account for all of the possible >>>> enrichments? Problem there is that similar data may not be mapped to >> the >>>> same field names across sensors. We may need to think about expanding >> our >>>> base JSON schema beyond these 7 fields ( >>>> https://cwiki.apache.org/confluence/display/METRON/Metron+JSON+Object) >>> to >>>> account for normalizing things like URL, user name, and disposition >> (e.g. >>>> whether an action was allowed or denied). >>>> >>>> Thoughts? >>>> >>>> -Kyle >>>> >>>> On Tue, Jan 3, 2017 at 11:30 AM, [email protected] <[email protected]> >>>> wrote: >>>> >>>>> For those interested, I ended up finding a recording of the talk >> itself >>>>> when doing some Avro research - https://www.youtube.com/watch? >>>>> v=tB28rPTvRiI >>>>> >>>>> Jon >>>>> >>>>>> On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <[email protected]> wrote: >>>>>> >>>>>> I’m not an expert on these things, but my understanding is that >> Avro >>>> and >>>>>> ORC serve many of the same needs. The biggest difference is that >> ORC >>>> is >>>>>> columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in >>>> detail >>>>>> at last year’s Hadoop Summit; the slideshare prezo is here: >>>>>> http://www.slideshare.net/HadoopSummit/file-format- >>>>> benchmark-avro-json-orc-parquet >>>>>> >>>>>> It’s conclusion: “For complex tables with common strings, Avro with >>>>> Snappy >>>>>> is a good fit. For other tables [or when applications “just need a >>> few >>>>>> columns” of the tables], ORC with Zlib is a good fit.” (The >> addition >>>> in >>>>>> square brackets incorporates a quote from another part of the >> prezo.) >>>>> But >>>>>> do look at the prezo please, it gives detailed benchmarks showing >>> when >>>>> each >>>>>> one is better. >>>>>> >>>>>> --Matt >>>>>> >>>>>> On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote: >>>>>> >>>>>> I don't recall a conversation on that product specifically, but >>>> I've >>>>>> definitely brought up the need to search HDFS from time to >> time. >>>>>> Things >>>>>> like Spark SQL, Hive, Oozie have been discussed, but Avro is >> new >>> to >>>>> me >>>>>> I'll >>>>>> have to look into it. Are you able to summarize it's benefits? >>>>>> >>>>>> Jon >>>>>> >>>>>> On Wed, Dec 28, 2016, 14:45 Kyle Richardson < >>>>> [email protected] >>>>>>> >>>>>> wrote: >>>>>> >>>>>>> This thread got me thinking... there are likely a fair number >>> of >>>>> use >>>>>> cases >>>>>>> for searching and analyzing the output stored in HDFS. Dima's >>> use >>>>>> case is >>>>>>> certainly one. Has there been any discussion on the use of >> Avro >>>> to >>>>>> store >>>>>>> the output in HDFS? This would likely require an expansion of >>> the >>>>>> current >>>>>>> json schema. >>>>>>> >>>>>>> -Kyle >>>>>>> >>>>>>> On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella < >>>> [email protected]> >>>>>> wrote: >>>>>>> >>>>>>>> Oozie (or something like it) would appear to me to be the >>>> correct >>>>>> tool >>>>>>>> here. You are likely moving files around and pinning up >> hive >>>>>> tables: >>>>>>>> >>>>>>>> - Moving the data written in HDFS from >>>>>> /apps/metron/enrichment/${ >>>>>>>> sensor} >>>>>>>> to another directory in HDFS >>>>>>>> - Running a job in Hive or pig or spark to take the JSON >>>>> blobs, >>>>>> map >>>>>>> them >>>>>>>> to rows and pin it up as an ORC table for downstream >>>> analytics >>>>>>>> >>>>>>>> NiFi is mostly about getting data in the cluster, not >> really >>>> for >>>>>>> scheduling >>>>>>>> large-scale batch ETL, I think. >>>>>>>> >>>>>>>> Casey >>>>>>>> >>>>>>>> On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov < >>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you for reply Carolyn, >>>>>>>>> >>>>>>>>> Currently for the test purposes we enrich flow with Geo >> and >>>>>> ThreatIntel >>>>>>>>> malware IP, but plan to expand this further. >>>>>>>>> >>>>>>>>> Our dev team is working on Oozie job to process this. So >>>>>> meanwhile I >>>>>>>>> wonder if I could use NiFi for this purpose (because we >>>> already >>>>>> using >>>>>>> it >>>>>>>>> for data ingest and stream). >>>>>>>>> >>>>>>>>> Could you elaborate why it may be overkill? The idea is >> to >>>> have >>>>>>>>> everything in one place instead of hacking into Metron >>>>> libraries >>>>>> and >>>>>>>> code. >>>>>>>>> >>>>>>>>> - Dima >>>>>>>>> >>>>>>>>>> On 12/22/2016 02:26 AM, Carolyn Duby wrote: >>>>>>>>>> Hi Dima - >>>>>>>>>> >>>>>>>>>> What type of analytics are you looking to do? Is the >>>>>> normalized >>>>>>> format >>>>>>>>> not working? You could use an oozie or spark job to >> create >>>>>> derivative >>>>>>>>> tables. >>>>>>>>>> >>>>>>>>>> Nifi may be overkill for breaking up the kafka stream. >>>> Spark >>>>>>> streaming >>>>>>>>> may be easier. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Carolyn >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sent from my Verizon, Samsung Galaxy smartphone >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------- Original message -------- >>>>>>>>>> From: Dima Kovalyov <[email protected]> >>>>>>>>>> Date: 12/21/16 6:28 PM (GMT-05:00) >>>>>>>>>> To: [email protected] >>>>>>>>>> Subject: Long-term storage for enriched data >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Currently we are researching fast and resources >> efficient >>>> way >>>>>> to save >>>>>>>>>> enriched data in Hive for further Analytics. >>>>>>>>>> >>>>>>>>>> There are two scenarios that we consider: >>>>>>>>>> a) Use Ozzie Java job that uses Metron enrichment >> classes >>>> to >>>>>>> "manually" >>>>>>>>>> enrich each line of the source data that is picked up >>> from >>>>> the >>>>>> source >>>>>>>>>> dir (the one that we have developed already and using). >>>> That >>>>> is >>>>>>>>>> something that we developed on our own. Downside: >> custom >>>> code >>>>>> that >>>>>>>> built >>>>>>>>>> on top of Metron source code. >>>>>>>>>> >>>>>>>>>> b) Use NiFi to listen for indexing Kafka topic -> split >>>>> stream >>>>>> by >>>>>>>> source >>>>>>>>>> type -> Put every source type in corresponding Hive >>> table. >>>>>>>>>> >>>>>>>>>> I wonder, if someone was going any of this direction >> and >>> if >>>>>> there are >>>>>>>>>> best practices for this? Please advise. >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> - Dima >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> -- >>>>>> >>>>>> Jon >>>>>> >>>>>> Sent from my mobile device >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>> >>>>> Jon >>>>> >>>>> Sent from my mobile device >>>>> >>>> >>> -- >>> >>> Jon >>> >>> Sent from my mobile device >>> >> > -- > > Jon > > Sent from my mobile device
