Hi Samza devs, users and enthusiasts, I've kept an eye on the Samza project for a while and I think it's super cool! I hope it continues to mature and expand as it seems very promising (:
One thing I've been wondering for a while is: how do people serve the data they computed on Samza? More specifically: 1. How do you expose the output of Samza jobs to online applications that need low-latency reads? 2. Are these online apps mostly internal (i.e.: analytics, dashboards, etc.) or public/user-facing? 3. What systems do you currently use (or plan to use in the short-term) to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? Others? 4. Are you satisfied or are you facing challenges in terms of the write throughput supported by these storage/serving systems? What about read throughput? 5. Are there situations where you wish to re-process all historical data when making improvements to your Samza job, which results in the need to re-ingest all of the Samza output into your online serving system (as described in the Kappa Architecture<http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html>) ? Is this easy breezy or painful? Do you need to throttle it lest your serving system will fall over? 6. If there was a highly-optimized and reliable way of ingesting partitioned streams quickly into your online serving system, would that help you leverage Samza more effectively? Your insights would be much appreciated! Thanks (: -- Felix