Re: Long-term storage for enriched data

[email protected] Tue, 03 Jan 2017 07:42:17 -0800

Right, I second that.  That was kind of my intent with my initial question
(although I did a bad job at making it clear) - Metron specific
benefits/details for Avro use.


Sound like it may make sense for someone to throw together a proposal doc?
Not volunteering though =)

Jon

On Tue, Jan 3, 2017 at 10:07 AM Otto Fowler <[email protected]> wrote:

> What I would like to see is something on using avro with a non-static
> model, such as would be required with metron should new enrichments or
> threat - intelligence, Stellar capabilities, or source changes.
>
>
> On January 2, 2017 at 11:41:32, Carolyn Duby ([email protected])
> wrote:
>
> Avro is a format that contains both the data and the schema. Here is a
> quick summary:
>
> https://avro.apache.org/docs/current/
>
>
> Thanks
> Carolyn
>
>
>
> On 1/1/17, 8:41 PM, "Matt Foley" <[email protected]> wrote:
>
> >I’m not an expert on these things, but my understanding is that Avro and
> ORC serve many of the same needs. The biggest difference is that ORC is
> columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in detail at
> last year’s Hadoop Summit; the slideshare prezo is here:
>
> http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
> >
> >It’s conclusion: “For complex tables with common strings, Avro with Snappy
> is a good fit. For other tables [or when applications “just need a few
> columns” of the tables], ORC with Zlib is a good fit.” (The addition in
> square brackets incorporates a quote from another part of the prezo.) But
> do look at the prezo please, it gives detailed benchmarks showing when each
> one is better.
> >
> >--Matt
> >
> >On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote:
> >
> > I don't recall a conversation on that product specifically, but I've
> > definitely brought up the need to search HDFS from time to time. Things
> > like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me
> I'll
> > have to look into it. Are you able to summarize it's benefits?
> >
> > Jon
> >
> > On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected]>
> > wrote:
> >
> > > This thread got me thinking... there are likely a fair number of use
> cases
> > > for searching and analyzing the output stored in HDFS. Dima's use case
> is
> > > certainly one. Has there been any discussion on the use of Avro to
> store
> > > the output in HDFS? This would likely require an expansion of the
> current
> > > json schema.
> > >
> > > -Kyle
> > >
> > > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]>
> wrote:
> > >
> > > > Oozie (or something like it) would appear to me to be the correct
> tool
> > > > here. You are likely moving files around and pinning up hive tables:
> > > >
> > > > - Moving the data written in HDFS from /apps/metron/enrichment/${
> > > > sensor}
> > > > to another directory in HDFS
> > > > - Running a job in Hive or pig or spark to take the JSON blobs, map
> > > them
> > > > to rows and pin it up as an ORC table for downstream analytics
> > > >
> > > > NiFi is mostly about getting data in the cluster, not really for
> > > scheduling
> > > > large-scale batch ETL, I think.
> > > >
> > > > Casey
> > > >
> > > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Thank you for reply Carolyn,
> > > > >
> > > > > Currently for the test purposes we enrich flow with Geo and
> ThreatIntel
> > > > > malware IP, but plan to expand this further.
> > > > >
> > > > > Our dev team is working on Oozie job to process this. So meanwhile
> I
> > > > > wonder if I could use NiFi for this purpose (because we already
> using
> > > it
> > > > > for data ingest and stream).
> > > > >
> > > > > Could you elaborate why it may be overkill? The idea is to have
> > > > > everything in one place instead of hacking into Metron libraries
> and
> > > > code.
> > > > >
> > > > > - Dima
> > > > >
> > > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > > > > > Hi Dima -
> > > > > >
> > > > > > What type of analytics are you looking to do? Is the normalized
> > > format
> > > > > not working? You could use an oozie or spark job to create
> derivative
> > > > > tables.
> > > > > >
> > > > > > Nifi may be overkill for breaking up the kafka stream. Spark
> > > streaming
> > > > > may be easier.
> > > > > >
> > > > > > Thanks
> > > > > > Carolyn
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sent from my Verizon, Samsung Galaxy smartphone
> > > > > >
> > > > > >
> > > > > > -------- Original message --------
> > > > > > From: Dima Kovalyov <[email protected]>
> > > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > > > > > To: [email protected]
> > > > > > Subject: Long-term storage for enriched data
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Currently we are researching fast and resources efficient way to
> save
> > > > > > enriched data in Hive for further Analytics.
> > > > > >
> > > > > > There are two scenarios that we consider:
> > > > > > a) Use Ozzie Java job that uses Metron enrichment classes to
> > > "manually"
> > > > > > enrich each line of the source data that is picked up from the
> source
> > > > > > dir (the one that we have developed already and using). That is
> > > > > > something that we developed on our own. Downside: custom code
> that
> > > > built
> > > > > > on top of Metron source code.
> > > > > >
> > > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream by
> > > > source
> > > > > > type -> Put every source type in corresponding Hive table.
> > > > > >
> > > > > > I wonder, if someone was going any of this direction and if there
> are
> > > > > > best practices for this? Please advise.
> > > > > > Thank you.
> > > > > >
> > > > > > - Dima
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > --
> >
> > Jon
> >
> > Sent from my mobile device
> >
> >
> >
> >
> >
>
-- 

Jon

Sent from my mobile device

Re: Long-term storage for enriched data

Reply via email to