Re: Long-term storage for enriched data

Carolyn Duby Mon, 02 Jan 2017 08:42:08 -0800

Avro is a format that contains both the data and the schema.  Here is a quick 
summary:


https://avro.apache.org/docs/current/


Thanks
Carolyn



On 1/1/17, 8:41 PM, "Matt Foley" <[email protected]> wrote:

>I’m not an expert on these things, but my understanding is that Avro and ORC 
>serve many of the same needs.  The biggest difference is that ORC is columnar, 
>and Avro isn’t.  Avro, ORC, and Parquet were compared in detail at last year’s 
>Hadoop Summit; the slideshare prezo is here: 
>http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
>
>It’s conclusion: “For complex tables with common strings, Avro with Snappy is 
>a good fit.  For other tables [or when applications “just need a few columns” 
>of the tables], ORC with Zlib is a good fit.”  (The addition in square 
>brackets incorporates a quote from another part of the prezo.)  But do look at 
>the prezo please, it gives detailed benchmarks showing when each one is better.
>
>--Matt
>
>On 1/1/17, 5:18 AM, "[email protected]" <[email protected]> wrote:
>
>    I don't recall a conversation on that product specifically, but I've
>    definitely brought up the need to search HDFS from time to time.  Things
>    like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me I'll
>    have to look into it.  Are you able to summarize it's benefits?
>    
>    Jon
>    
>    On Wed, Dec 28, 2016, 14:45 Kyle Richardson <[email protected]>
>    wrote:
>    
>    > This thread got me thinking... there are likely a fair number of use 
> cases
>    > for searching and analyzing the output stored in HDFS. Dima's use case is
>    > certainly one. Has there been any discussion on the use of Avro to store
>    > the output in HDFS? This would likely require an expansion of the current
>    > json schema.
>    >
>    > -Kyle
>    >
>    > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <[email protected]> wrote:
>    >
>    > > Oozie (or something like it) would appear to me to be the correct tool
>    > > here.  You are likely moving files around and pinning up hive tables:
>    > >
>    > >    - Moving the data written in HDFS from /apps/metron/enrichment/${
>    > > sensor}
>    > >    to another directory in HDFS
>    > >    - Running a job in Hive or pig or spark to take the JSON blobs, map
>    > them
>    > >    to rows and pin it up as an ORC table for downstream analytics
>    > >
>    > > NiFi is mostly about getting data in the cluster, not really for
>    > scheduling
>    > > large-scale batch ETL, I think.
>    > >
>    > > Casey
>    > >
>    > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov 
> <[email protected]>
>    > > wrote:
>    > >
>    > > > Thank you for reply Carolyn,
>    > > >
>    > > > Currently for the test purposes we enrich flow with Geo and 
> ThreatIntel
>    > > > malware IP, but plan to expand this further.
>    > > >
>    > > > Our dev team is working on Oozie job to process this. So meanwhile I
>    > > > wonder if I could use NiFi for this purpose (because we already using
>    > it
>    > > > for data ingest and stream).
>    > > >
>    > > > Could you elaborate why it may be overkill? The idea is to have
>    > > > everything in one place instead of hacking into Metron libraries and
>    > > code.
>    > > >
>    > > > - Dima
>    > > >
>    > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
>    > > > > Hi Dima -
>    > > > >
>    > > > > What type of analytics are you looking to do?  Is the normalized
>    > format
>    > > > not working?  You could use an oozie or spark job to create 
> derivative
>    > > > tables.
>    > > > >
>    > > > > Nifi may be overkill for breaking up the kafka stream.  Spark
>    > streaming
>    > > > may be easier.
>    > > > >
>    > > > > Thanks
>    > > > > Carolyn
>    > > > >
>    > > > >
>    > > > >
>    > > > > Sent from my Verizon, Samsung Galaxy smartphone
>    > > > >
>    > > > >
>    > > > > -------- Original message --------
>    > > > > From: Dima Kovalyov <[email protected]>
>    > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
>    > > > > To: [email protected]
>    > > > > Subject: Long-term storage for enriched data
>    > > > >
>    > > > > Hello,
>    > > > >
>    > > > > Currently we are researching fast and resources efficient way to 
> save
>    > > > > enriched data in Hive for further Analytics.
>    > > > >
>    > > > > There are two scenarios that we consider:
>    > > > > a) Use Ozzie Java job that uses Metron enrichment classes to
>    > "manually"
>    > > > > enrich each line of the source data that is picked up from the 
> source
>    > > > > dir (the one that we have developed already and using). That is
>    > > > > something that we developed on our own. Downside: custom code that
>    > > built
>    > > > > on top of Metron source code.
>    > > > >
>    > > > > b) Use NiFi to listen for indexing Kafka topic -> split stream by
>    > > source
>    > > > > type -> Put every source type in corresponding Hive table.
>    > > > >
>    > > > > I wonder, if someone was going any of this direction and if there 
> are
>    > > > > best practices for this? Please advise.
>    > > > > Thank you.
>    > > > >
>    > > > > - Dima
>    > > > >
>    > > > >
>    > > >
>    > > >
>    > >
>    >
>    -- 
>    
>    Jon
>    
>    Sent from my mobile device
>    
>
>
>
>

Re: Long-term storage for enriched data

Reply via email to