Re: Long-term storage for enriched data

[email protected] Tue, 03 Jan 2017 08:31:10 -0800

For those interested, I ended up finding a recording of the talk itself
when doing some Avro research - https://www.youtube.com/watch?v=tB28rPTvRiI


Jon

On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <[email protected]> wrote:

> I’m not an expert on these things, but my understanding is that Avro and
> ORC serve many of the same needs.  The biggest difference is that ORC is
> columnar, and Avro isn’t.  Avro, ORC, and Parquet were compared in detail
> at last year’s Hadoop Summit; the slideshare prezo is here:
> http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
>
> It’s conclusion: “For complex tables with common strings, Avro with Snappy
> is a good fit.  For other tables [or when applications “just need a few
> columns” of the tables], ORC with Zlib is a good fit.”  (The addition in
> square brackets incorporates a quote from another part of the prezo.)  But
> do look at the prezo please, it gives detailed benchmarks showing when each
> one is better.
>
> --Matt
>
> On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote:
>
>     I don't recall a conversation on that product specifically, but I've
>     definitely brought up the need to search HDFS from time to time.
> Things
>     like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me
> I'll
>     have to look into it.  Are you able to summarize it's benefits?
>
>     Jon
>
>     On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected]
> >
>     wrote:
>
>     > This thread got me thinking... there are likely a fair number of use
> cases
>     > for searching and analyzing the output stored in HDFS. Dima's use
> case is
>     > certainly one. Has there been any discussion on the use of Avro to
> store
>     > the output in HDFS? This would likely require an expansion of the
> current
>     > json schema.
>     >
>     > -Kyle
>     >
>     > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]>
> wrote:
>     >
>     > > Oozie (or something like it) would appear to me to be the correct
> tool
>     > > here.  You are likely moving files around and pinning up hive
> tables:
>     > >
>     > >    - Moving the data written in HDFS from
> /apps/metron/enrichment/${
>     > > sensor}
>     > >    to another directory in HDFS
>     > >    - Running a job in Hive or pig or spark to take the JSON blobs,
> map
>     > them
>     > >    to rows and pin it up as an ORC table for downstream analytics
>     > >
>     > > NiFi is mostly about getting data in the cluster, not really for
>     > scheduling
>     > > large-scale batch ETL, I think.
>     > >
>     > > Casey
>     > >
>     > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
> [email protected]>
>     > > wrote:
>     > >
>     > > > Thank you for reply Carolyn,
>     > > >
>     > > > Currently for the test purposes we enrich flow with Geo and
> ThreatIntel
>     > > > malware IP, but plan to expand this further.
>     > > >
>     > > > Our dev team is working on Oozie job to process this. So
> meanwhile I
>     > > > wonder if I could use NiFi for this purpose (because we already
> using
>     > it
>     > > > for data ingest and stream).
>     > > >
>     > > > Could you elaborate why it may be overkill? The idea is to have
>     > > > everything in one place instead of hacking into Metron libraries
> and
>     > > code.
>     > > >
>     > > > - Dima
>     > > >
>     > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
>     > > > > Hi Dima -
>     > > > >
>     > > > > What type of analytics are you looking to do?  Is the
> normalized
>     > format
>     > > > not working?  You could use an oozie or spark job to create
> derivative
>     > > > tables.
>     > > > >
>     > > > > Nifi may be overkill for breaking up the kafka stream.  Spark
>     > streaming
>     > > > may be easier.
>     > > > >
>     > > > > Thanks
>     > > > > Carolyn
>     > > > >
>     > > > >
>     > > > >
>     > > > > Sent from my Verizon, Samsung Galaxy smartphone
>     > > > >
>     > > > >
>     > > > > -------- Original message --------
>     > > > > From: Dima Kovalyov <[email protected]>
>     > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
>     > > > > To: [email protected]
>     > > > > Subject: Long-term storage for enriched data
>     > > > >
>     > > > > Hello,
>     > > > >
>     > > > > Currently we are researching fast and resources efficient way
> to save
>     > > > > enriched data in Hive for further Analytics.
>     > > > >
>     > > > > There are two scenarios that we consider:
>     > > > > a) Use Ozzie Java job that uses Metron enrichment classes to
>     > "manually"
>     > > > > enrich each line of the source data that is picked up from the
> source
>     > > > > dir (the one that we have developed already and using). That is
>     > > > > something that we developed on our own. Downside: custom code
> that
>     > > built
>     > > > > on top of Metron source code.
>     > > > >
>     > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream
> by
>     > > source
>     > > > > type -> Put every source type in corresponding Hive table.
>     > > > >
>     > > > > I wonder, if someone was going any of this direction and if
> there are
>     > > > > best practices for this? Please advise.
>     > > > > Thank you.
>     > > > >
>     > > > > - Dima
>     > > > >
>     > > > >
>     > > >
>     > > >
>     > >
>     >
>     --
>
>     Jon
>
>     Sent from my mobile device
>
>
>
>
> --

Jon

Sent from my mobile device

Re: Long-term storage for enriched data

Reply via email to