Thanks Chris, this is exactly what I was looking for. Nice idea about Druid, might be worth a trip down the rabbit hole, yup.
/******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Mon, Jan 13, 2014 at 4:07 PM, Chris Riccomini <[email protected]>wrote: > Hey Joe, > > I'm going to do answers in-line. > > On 1/13/14 11:06 AM, "Joe Stein" <[email protected]> wrote: > > >Hello, I was wondering what different system(s) folks were using as a > >final > >resting place for data streamed and processed through Samza and how they > >were getting it there? > > > >So are folks having Samza save the final streamed state of a job go to a > >Kafka topic and then have a Kafka consumer connected to that topic which > >fetches those results and then pushes those results to another system or > >are they plugging those systems into the end of the Samza job directly? > > We tend to have a mix of styles. For some Samza jobs, the final output is > a Kafka topic, which then gets consumed by some downstream system (search, > realtime OLAP, etc). We also have some Samza jobs that write directly to > either a database or web service. For the database-output jobs, we > currently use the database's client directly from the StreamTask, but > we're considering adopting a model where we'd write SystemProducers that > actually write to the database under-the-hood. The primary advantage of > this approach is that it would be an easy way to use threads and get async > writes to the database--we'd only have to make sure everything as flushed > when Samza is committing its offsets. > > > > > >Also what system(s) are folks using to store their aggregate counts for > >use > >(assuming counting calculation streams) or systems for non-counting > >calculations in either case both for querying by other systems afterwards? > > The kinds of systems that we materialize to tend to look like these: > > * A realtime OLAP system. > * A metrics/monitoring system (uses RRDs under the hood). > * A social graph system. > * A search system. > * A distributed database. > > For realtime OLAP/counting, you might have a look at Druid, which already > has a Kafka ingestion point built in. > > If the system you're looking at doesn't have a Kafka ingestion point built > in, the decision has to be made about whether to write a SystemProducer, > and have your StreamTask write to the system directly, or to write a shim > outside of Samza that reads from the Kafka topic, and writes to the > destination system. I think the proper solutions depends on your use case. > One argument for putting the writer outside of Samza would be if you > wanted to get write-locality with the destination system (co-locate the > writer with the DB it's writing to). You'll have to think through your use > case. Both styles work. > > > > >Thanks in advance. > > > >/******************************************* > > Joe Stein > > Founder, Principal Consultant > > Big Data Open Source Security LLC > > http://www.stealth.ly > > Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> > >********************************************/ > >
