Re: Long-term storage for enriched data

Otto Fowler Tue, 03 Jan 2017 07:07:53 -0800

What I would like to see is something on using avro with a non-static
model, such as would be required with metron should new enrichments or
threat - intelligence, Stellar capabilities, or source changes.



On January 2, 2017 at 11:41:32, Carolyn Duby ([email protected]) wrote:

Avro is a format that contains both the data and the schema. Here is a
quick summary:

https://avro.apache.org/docs/current/


Thanks
Carolyn



On 1/1/17, 8:41 PM, "Matt Foley" <[email protected]> wrote:

>I’m not an expert on these things, but my understanding is that Avro and
ORC serve many of the same needs. The biggest difference is that ORC is
columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in detail at
last year’s Hadoop Summit; the slideshare prezo is here:
http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
>
>It’s conclusion: “For complex tables with common strings, Avro with Snappy
is a good fit. For other tables [or when applications “just need a few
columns” of the tables], ORC with Zlib is a good fit.” (The addition in
square brackets incorporates a quote from another part of the prezo.) But
do look at the prezo please, it gives detailed benchmarks showing when each
one is better.
>
>--Matt
>
>On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote:
>
> I don't recall a conversation on that product specifically, but I've
> definitely brought up the need to search HDFS from time to time. Things
> like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me
I'll
> have to look into it. Are you able to summarize it's benefits?
>
> Jon
>
> On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected]>
> wrote:
>
> > This thread got me thinking... there are likely a fair number of use
cases
> > for searching and analyzing the output stored in HDFS. Dima's use case
is
> > certainly one. Has there been any discussion on the use of Avro to
store
> > the output in HDFS? This would likely require an expansion of the
current
> > json schema.
> >
> > -Kyle
> >
> > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]>
wrote:
> >
> > > Oozie (or something like it) would appear to me to be the correct
tool
> > > here. You are likely moving files around and pinning up hive tables:
> > >
> > > - Moving the data written in HDFS from /apps/metron/enrichment/${
> > > sensor}
> > > to another directory in HDFS
> > > - Running a job in Hive or pig or spark to take the JSON blobs, map
> > them
> > > to rows and pin it up as an ORC table for downstream analytics
> > >
> > > NiFi is mostly about getting data in the cluster, not really for
> > scheduling
> > > large-scale batch ETL, I think.
> > >
> > > Casey
> > >
> > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
[email protected]>
> > > wrote:
> > >
> > > > Thank you for reply Carolyn,
> > > >
> > > > Currently for the test purposes we enrich flow with Geo and
ThreatIntel
> > > > malware IP, but plan to expand this further.
> > > >
> > > > Our dev team is working on Oozie job to process this. So meanwhile
I
> > > > wonder if I could use NiFi for this purpose (because we already
using
> > it
> > > > for data ingest and stream).
> > > >
> > > > Could you elaborate why it may be overkill? The idea is to have
> > > > everything in one place instead of hacking into Metron libraries
and
> > > code.
> > > >
> > > > - Dima
> > > >
> > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > > > > Hi Dima -
> > > > >
> > > > > What type of analytics are you looking to do? Is the normalized
> > format
> > > > not working? You could use an oozie or spark job to create
derivative
> > > > tables.
> > > > >
> > > > > Nifi may be overkill for breaking up the kafka stream. Spark
> > streaming
> > > > may be easier.
> > > > >
> > > > > Thanks
> > > > > Carolyn
> > > > >
> > > > >
> > > > >
> > > > > Sent from my Verizon, Samsung Galaxy smartphone
> > > > >
> > > > >
> > > > > -------- Original message --------
> > > > > From: Dima Kovalyov <[email protected]>
> > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > > > > To: [email protected]
> > > > > Subject: Long-term storage for enriched data
> > > > >
> > > > > Hello,
> > > > >
> > > > > Currently we are researching fast and resources efficient way to
save
> > > > > enriched data in Hive for further Analytics.
> > > > >
> > > > > There are two scenarios that we consider:
> > > > > a) Use Ozzie Java job that uses Metron enrichment classes to
> > "manually"
> > > > > enrich each line of the source data that is picked up from the
source
> > > > > dir (the one that we have developed already and using). That is
> > > > > something that we developed on our own. Downside: custom code
that
> > > built
> > > > > on top of Metron source code.
> > > > >
> > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream by
> > > source
> > > > > type -> Put every source type in corresponding Hive table.
> > > > >
> > > > > I wonder, if someone was going any of this direction and if there
are
> > > > > best practices for this? Please advise.
> > > > > Thank you.
> > > > >
> > > > > - Dima
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> --
>
> Jon
>
> Sent from my mobile device
>
>
>
>
>

Re: Long-term storage for enriched data

Reply via email to