Stream processing systems are very powerful and yes – we probably don't need so much power.
On other hand Oozie and Azkaban are somewhat limited or for different purpose. Azkaban has nice visual workflow views. But as far I understand how data flow is handled between tasks is not defined. We don't use Hadoop and neither Map/Reduce. So it seems have must use something like Kafka to have durable data streams. Oozie looks complex and again more oriented for Hadoop and Map/Reduce. Oozie seems to supports other tasks also but again we must develop our own durable data streams. As I understand Samza uses only Yarn from Hadoop? I like Kafka a lot. But some workflow tasks handling is also needed. Start/Stop/Replace tasks, monitor, metrics, etc... I like to have capability to rerun failed tasks or just take some data from stream and rerun it again on development environment to test and bug fixing. Kafka offers wonderful capability to consume data as many like and from any consumer you wont. We probably extend retention period from week to month. Also Kafka gives ability to run tasks on different machines. When some of them fails we still have data available and can start tasks on another machine without much complicated preparations. So Kafka gives data driven approach and less complicated control structures. Possible solutions: 1. Try to use Samza, more or less custom development is needed. Yarn have notation of DAG - Directed Acyclic Graph. But it seems where aren't any DAG implementations yet. 2. Use Kafka and develop some custom workflow engine. Whats possible but I hate to implement all by self. 3. Kafka and some parts of workflow engine from some other framework. Some custom code. What else? Thanks toivo 2013/12/10 Jay Kreps <[email protected]> > Are you sure you are looking for a stream processing system and not a > workflow scheduler like Oozie, Azkaban, etc? > > -Jay > > > On Tue, Dec 10, 2013 at 2:44 AM, toivo a <[email protected]> wrote: > > > Samza looks really good. > > Many concepts are exactly what is needed. > > But we have some specific requirements and I am not sure how to implement > > them with Samza. > > > > Background > > --------------- > > We have components which corresponds to Samza Job/Task. > > Some are universal components – for example FileReader, JMSListener, > > SCPSender, PDFCreator, etc. > > Such reusable components are used in many different business processes. > > And we have business process specific components – fore example > > InvoiceGenerator, ClientsImporter, ProductListFilter, etc... > > Components can be either message sources or transformers. > > Message source components don't have inputs, only output. > > Transformers have inputs and outputs. > > > > We have a notation of business process – very similar to BPMN. > > Business definition has name and flow/topology description and also > > every component in business process can have process specific > > configuration. > > For example: > > process InvoiceExport, might have flow: InvoiceGenerator -> PDFCreator -> > > SCPSender > > And > > process OverduePaymentExport, flow: OverduePaymentGenerator -> PDFCreator > > -> SCPSender > > > > Both processes use same components PDFCreator and SCPSender but > components > > configurations may differ. > > > > Questions > > ------------ > > As I understand Samza doesn't have something like Business Process. > > We like to run different Business Process at the same time with Samza > which > > are isolated form each other. > > So do we must use different input stream names to isolate processes data > > flows form each other? > > For example process InvoiceExport use streams named: > > InvoiceExport.PDFCreator.stream > > and process OverduePaymentExport use streams named: > > OverduePaymentExport.PDFCreator.stream > > > > And also it seems that we can't start Business Process as a whole, but > > instead start components on by one. > > And process shutting down means shutting down components also one by one. > > Any ideas how this can be automated? > > > > Currently our business process configuration file includes configuration > > for all components which are used in process. > > For example different business processes have different queue name for > > JMSListener component. > > I am confused how to translate this to Samza job configurations. > > Create some sort of custom translator which generates Samza job > > configurations? > > > > We don't have actually a lot of messages and initially we don't need > speedy > > processing. > > But not losing a single message and reliability is needed. > > What is the current state of Samza? > > Can we use it in production? > > Can you give same time hint when stable release is out? > > > > Please can anyone help me? > > > > Thank you for your time > > toivo > > >
