For those interested, I ended up finding a recording of the talk itself when doing some Avro research - https://www.youtube.com/watch?v=tB28rPTvRiI
Jon On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <[email protected]> wrote: > I’m not an expert on these things, but my understanding is that Avro and > ORC serve many of the same needs. The biggest difference is that ORC is > columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in detail > at last year’s Hadoop Summit; the slideshare prezo is here: > http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet > > It’s conclusion: “For complex tables with common strings, Avro with Snappy > is a good fit. For other tables [or when applications “just need a few > columns” of the tables], ORC with Zlib is a good fit.” (The addition in > square brackets incorporates a quote from another part of the prezo.) But > do look at the prezo please, it gives detailed benchmarks showing when each > one is better. > > --Matt > > On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote: > > I don't recall a conversation on that product specifically, but I've > definitely brought up the need to search HDFS from time to time. > Things > like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me > I'll > have to look into it. Are you able to summarize it's benefits? > > Jon > > On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected] > > > wrote: > > > This thread got me thinking... there are likely a fair number of use > cases > > for searching and analyzing the output stored in HDFS. Dima's use > case is > > certainly one. Has there been any discussion on the use of Avro to > store > > the output in HDFS? This would likely require an expansion of the > current > > json schema. > > > > -Kyle > > > > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]> > wrote: > > > > > Oozie (or something like it) would appear to me to be the correct > tool > > > here. You are likely moving files around and pinning up hive > tables: > > > > > > - Moving the data written in HDFS from > /apps/metron/enrichment/${ > > > sensor} > > > to another directory in HDFS > > > - Running a job in Hive or pig or spark to take the JSON blobs, > map > > them > > > to rows and pin it up as an ORC table for downstream analytics > > > > > > NiFi is mostly about getting data in the cluster, not really for > > scheduling > > > large-scale batch ETL, I think. > > > > > > Casey > > > > > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov < > [email protected]> > > > wrote: > > > > > > > Thank you for reply Carolyn, > > > > > > > > Currently for the test purposes we enrich flow with Geo and > ThreatIntel > > > > malware IP, but plan to expand this further. > > > > > > > > Our dev team is working on Oozie job to process this. So > meanwhile I > > > > wonder if I could use NiFi for this purpose (because we already > using > > it > > > > for data ingest and stream). > > > > > > > > Could you elaborate why it may be overkill? The idea is to have > > > > everything in one place instead of hacking into Metron libraries > and > > > code. > > > > > > > > - Dima > > > > > > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote: > > > > > Hi Dima - > > > > > > > > > > What type of analytics are you looking to do? Is the > normalized > > format > > > > not working? You could use an oozie or spark job to create > derivative > > > > tables. > > > > > > > > > > Nifi may be overkill for breaking up the kafka stream. Spark > > streaming > > > > may be easier. > > > > > > > > > > Thanks > > > > > Carolyn > > > > > > > > > > > > > > > > > > > > Sent from my Verizon, Samsung Galaxy smartphone > > > > > > > > > > > > > > > -------- Original message -------- > > > > > From: Dima Kovalyov <[email protected]> > > > > > Date: 12/21/16 6:28 PM (GMT-05:00) > > > > > To: [email protected] > > > > > Subject: Long-term storage for enriched data > > > > > > > > > > Hello, > > > > > > > > > > Currently we are researching fast and resources efficient way > to save > > > > > enriched data in Hive for further Analytics. > > > > > > > > > > There are two scenarios that we consider: > > > > > a) Use Ozzie Java job that uses Metron enrichment classes to > > "manually" > > > > > enrich each line of the source data that is picked up from the > source > > > > > dir (the one that we have developed already and using). That is > > > > > something that we developed on our own. Downside: custom code > that > > > built > > > > > on top of Metron source code. > > > > > > > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream > by > > > source > > > > > type -> Put every source type in corresponding Hive table. > > > > > > > > > > I wonder, if someone was going any of this direction and if > there are > > > > > best practices for this? Please advise. > > > > > Thank you. > > > > > > > > > > - Dima > > > > > > > > > > > > > > > > > > > > > > > > -- > > Jon > > Sent from my mobile device > > > > > -- Jon Sent from my mobile device
