Right, I second that. That was kind of my intent with my initial question (although I did a bad job at making it clear) - Metron specific benefits/details for Avro use.
Sound like it may make sense for someone to throw together a proposal doc? Not volunteering though =) Jon On Tue, Jan 3, 2017 at 10:07 AM Otto Fowler <[email protected]> wrote: > What I would like to see is something on using avro with a non-static > model, such as would be required with metron should new enrichments or > threat - intelligence, Stellar capabilities, or source changes. > > > On January 2, 2017 at 11:41:32, Carolyn Duby ([email protected]) > wrote: > > Avro is a format that contains both the data and the schema. Here is a > quick summary: > > https://avro.apache.org/docs/current/ > > > Thanks > Carolyn > > > > On 1/1/17, 8:41 PM, "Matt Foley" <[email protected]> wrote: > > >I’m not an expert on these things, but my understanding is that Avro and > ORC serve many of the same needs. The biggest difference is that ORC is > columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in detail at > last year’s Hadoop Summit; the slideshare prezo is here: > > http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet > > > >It’s conclusion: “For complex tables with common strings, Avro with Snappy > is a good fit. For other tables [or when applications “just need a few > columns” of the tables], ORC with Zlib is a good fit.” (The addition in > square brackets incorporates a quote from another part of the prezo.) But > do look at the prezo please, it gives detailed benchmarks showing when each > one is better. > > > >--Matt > > > >On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote: > > > > I don't recall a conversation on that product specifically, but I've > > definitely brought up the need to search HDFS from time to time. Things > > like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me > I'll > > have to look into it. Are you able to summarize it's benefits? > > > > Jon > > > > On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected]> > > wrote: > > > > > This thread got me thinking... there are likely a fair number of use > cases > > > for searching and analyzing the output stored in HDFS. Dima's use case > is > > > certainly one. Has there been any discussion on the use of Avro to > store > > > the output in HDFS? This would likely require an expansion of the > current > > > json schema. > > > > > > -Kyle > > > > > > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]> > wrote: > > > > > > > Oozie (or something like it) would appear to me to be the correct > tool > > > > here. You are likely moving files around and pinning up hive tables: > > > > > > > > - Moving the data written in HDFS from /apps/metron/enrichment/${ > > > > sensor} > > > > to another directory in HDFS > > > > - Running a job in Hive or pig or spark to take the JSON blobs, map > > > them > > > > to rows and pin it up as an ORC table for downstream analytics > > > > > > > > NiFi is mostly about getting data in the cluster, not really for > > > scheduling > > > > large-scale batch ETL, I think. > > > > > > > > Casey > > > > > > > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov < > [email protected]> > > > > wrote: > > > > > > > > > Thank you for reply Carolyn, > > > > > > > > > > Currently for the test purposes we enrich flow with Geo and > ThreatIntel > > > > > malware IP, but plan to expand this further. > > > > > > > > > > Our dev team is working on Oozie job to process this. So meanwhile > I > > > > > wonder if I could use NiFi for this purpose (because we already > using > > > it > > > > > for data ingest and stream). > > > > > > > > > > Could you elaborate why it may be overkill? The idea is to have > > > > > everything in one place instead of hacking into Metron libraries > and > > > > code. > > > > > > > > > > - Dima > > > > > > > > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote: > > > > > > Hi Dima - > > > > > > > > > > > > What type of analytics are you looking to do? Is the normalized > > > format > > > > > not working? You could use an oozie or spark job to create > derivative > > > > > tables. > > > > > > > > > > > > Nifi may be overkill for breaking up the kafka stream. Spark > > > streaming > > > > > may be easier. > > > > > > > > > > > > Thanks > > > > > > Carolyn > > > > > > > > > > > > > > > > > > > > > > > > Sent from my Verizon, Samsung Galaxy smartphone > > > > > > > > > > > > > > > > > > -------- Original message -------- > > > > > > From: Dima Kovalyov <[email protected]> > > > > > > Date: 12/21/16 6:28 PM (GMT-05:00) > > > > > > To: [email protected] > > > > > > Subject: Long-term storage for enriched data > > > > > > > > > > > > Hello, > > > > > > > > > > > > Currently we are researching fast and resources efficient way to > save > > > > > > enriched data in Hive for further Analytics. > > > > > > > > > > > > There are two scenarios that we consider: > > > > > > a) Use Ozzie Java job that uses Metron enrichment classes to > > > "manually" > > > > > > enrich each line of the source data that is picked up from the > source > > > > > > dir (the one that we have developed already and using). That is > > > > > > something that we developed on our own. Downside: custom code > that > > > > built > > > > > > on top of Metron source code. > > > > > > > > > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream by > > > > source > > > > > > type -> Put every source type in corresponding Hive table. > > > > > > > > > > > > I wonder, if someone was going any of this direction and if there > are > > > > > > best practices for this? Please advise. > > > > > > Thank you. > > > > > > > > > > > > - Dima > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jon > > > > Sent from my mobile device > > > > > > > > > > > -- Jon Sent from my mobile device
