Re: Samza final output/result implementations for streams

Joe Stein Mon, 13 Jan 2014 14:03:56 -0800

Thanks Chris, this is exactly what I was looking for.

Nice idea about Druid, might be worth a trip down the rabbit hole, yup.


/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Mon, Jan 13, 2014 at 4:07 PM, Chris Riccomini <[email protected]>wrote:

> Hey Joe,
>
> I'm going to do answers in-line.
>
> On 1/13/14 11:06 AM, "Joe Stein" <[email protected]> wrote:
>
> >Hello, I was wondering what different system(s) folks were using as a
> >final
> >resting place for data streamed and processed through Samza and how they
> >were getting it there?
> >
> >So are folks having Samza save the final streamed state of a job go to a
> >Kafka topic and then have a Kafka consumer connected to that topic which
> >fetches those results and then pushes those results to another system or
> >are they plugging those systems into the end of the Samza job directly?
>
> We tend to have a mix of styles. For some Samza jobs, the final output is
> a Kafka topic, which then gets consumed by some downstream system (search,
> realtime OLAP, etc). We also have some Samza jobs that write directly to
> either a database or web service. For the database-output jobs, we
> currently use the database's client directly from the StreamTask, but
> we're considering adopting a model where we'd write SystemProducers that
> actually write to the database under-the-hood. The primary advantage of
> this approach is that it would be an easy way to use threads and get async
> writes to the database--we'd only have to make sure everything as flushed
> when Samza is committing its offsets.
>
>
> >
> >Also what system(s) are folks using to store their aggregate counts for
> >use
> >(assuming counting calculation streams) or systems for non-counting
> >calculations in either case both for querying by other systems afterwards?
>
> The kinds of systems that we materialize to tend to look like these:
>
>  * A realtime OLAP system.
>  * A metrics/monitoring system (uses RRDs under the hood).
>  * A social graph system.
>  * A search system.
>  * A distributed database.
>
> For realtime OLAP/counting, you might have a look at Druid, which already
> has a Kafka ingestion point built in.
>
> If the system you're looking at doesn't have a Kafka ingestion point built
> in, the decision has to be made about whether to write a SystemProducer,
> and have your StreamTask write to the system directly, or to write a shim
> outside of Samza that reads from the Kafka topic, and writes to the
> destination system. I think the proper solutions depends on your use case.
> One argument for putting the writer outside of Samza would be if you
> wanted to get write-locality with the destination system (co-locate the
> writer with the DB it's writing to). You'll have to think through your use
> case. Both styles work.
>
> >
> >Thanks in advance.
> >
> >/*******************************************
> > Joe Stein
> > Founder, Principal Consultant
> > Big Data Open Source Security LLC
> > http://www.stealth.ly
> > Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> >********************************************/
>
>

Re: Samza final output/result implementations for streams

Reply via email to