Hi Eric, Thanks for additional clarification. I now have a better understanding of your use case. Your use case be accomplished using various approaches.
1. One Samza application that consumes all of your input streams (A and D) and also your intermediate output streams (B) and routes it to HDFS or ElasticSearch. With this approach, 1. You may have to scale up or scale down your entire job to react to changes to either of your inputs (A, D & B as well). It might potentially result in under utilization of resources. 2. A failure in one component could impact other components. For example, it is possible that there are other writers to output stream B and you don't want to disrupt them because of a bug in the transformation logic of input stream A. 3. On the similar lines of 2, it also possible that due to a HDFS or elastic search outage, the back pressure causes output B to grow which might potentially impact the time to catch up and also impact your job's performance (e.g. this can happen if you have priorities setup across your streams) 2. Another approach is to split them into multiple jobs; one for consuming input sources (A and D) and routing to appropriate destinations (B & C), another for consuming the output from the previous job (B) and routing it to HDFS or ElasticSearch. With this approach 1. You isolate the common/shared logic to write to HDFS or ElasticSearch into its own job which allows you to manage it independently including scale up/down. 2. During HDFS/ElasticSearch outages, other components are non-impacted and the back pressure causes your stream B to grow which Kafka can handle well. Our recommend API to write Samza application is Apache Beam <https://beam.apache.org/>. Examples on how to write a sample application using Beam API and run it using Samza can be found: https://github.com/apache/samza-beam-examples Please reach out to us if you have more questions. Thanks, Bharath On Wed, Oct 16, 2019 at 7:31 AM Eric Shieh <datosl...@gmail.com> wrote: > Thank you Bharath. Regarding my 2nd question, perhaps the following > scenario can help to illustrate what I am looking to achieve: > > Input stream A -> Job 1 -> Output stream B (Kafka Topic B) > Input stream A -> Job 2 -> Output stream C > Input stream D -> Job 3 -> Output stream B (Kafka Topic B) > Input stream B (Kafka Topic B) -> Elasticsearch (or write to HDFS) > > In the case of "Input stream B (Kafka Topic B) -> Elasticsearch (or write > to HDFS)" this is what I was referring to as "Common/shared system > services" that does not have any transformation logic except message sink > to either Elasticsearch or HDFS using Samza's systems/connectors. In other > words, Job 1 and Job 3 both output to "Output stream B" expecting messages > will be persisted in Elasticsearch or HDFS, would I need to specify the > system/connector configuration separately in Job 1 and Job 3? Is there a > way to have "Input stream B (Kafka Topic B) -> Elasticsearch (or write to > HDFS)" as its own stand-alone job so I can have the following: > > RESTful web services (or other none Samza services/applications) as Kafka > producer -> Input stream B (Kafka Topic B) -> Elasticsearch (or write to > HDFS) > > Regards, > > Eric > > On Mon, Oct 14, 2019 at 8:35 PM Bharath Kumara Subramanian < > codin.mart...@gmail.com> wrote: > > > Hi Eric, > > > > Answers to your questions are as follows > > > > > > > > > > > > > > > > *Can I, or is it recommended to, package multiple jobs as 1 > > deploymentwith > > > 1 properties file or keep each app separated? Based on > thedocumentation, > > > it appears to support 1 app/job within a singleconfiguration as there > is > > no > > > mechanism to assign multiple app classes andgiven each a name unless I > am > > > mistaken* > > > > > > > *app.class* is a single valued configuration and your understanding > about > > it based on the documentation is correct. > > > > > > > > > > > > > *If only 1 app per config+deployment, what is the best way to > > > handlerequirement #3 - common/shared system services as there is no app > > or > > > jobper say, I just need to specify the streams and output system > > > (ieorg.apache.samza.system.hdfs.writer* > > > > > > > There are couple of options to achieve your #3 requirement. > > > > 1. If there is enough commonality between your jobs, you could have > one > > application class that describes the logic and have the different > > configurations to modify the behavior of the application logic. This > > does > > come with some of the following considerations > > 1. Your deployment system needs to support deploying the same > > application with different configs. > > 2. Potential duplication of configuration if you configuration > system > > doesn't support hierarchies and overrides. > > 3. Potentially unmanageable for evolution, since the change in > > application affects multiple jobs and requires extensive testing > > across > > different configurations. > > 2. You could potentially have libraries to perform some piece of > > business logic and have your different jobs leverage them using > > composition. Some things to consider with this option > > 1. Your application and configuration stay isolated. > > 2. You could still leverage some of the common configurations if > your > > configuration system supports hierarchies and overrides > > 3. Alleviates concerns over evolution and testing as long as the > > changes are application specific. > > > > > > I am still unclear about the second part of your 2nd question. > > Do you mean to say all your jobs consume from same sources and write to > > sources and only your processing logic is different? > > > > > > > *common/shared system services as there is no app or jobper say, I just > > > need to specify the streams and output system* > > > > > > Also, I am not sure I follow what do you mean by "*there is no app or > > job"*. > > You still have 1 app per config + deployment, right? > > > > Thanks, > > Bharath > > > > On Thu, Oct 10, 2019 at 9:46 AM Eric Shieh <datosl...@gmail.com> wrote: > > > > > Hi, > > > > > > I am new to Samza, I am evaluating Samza as the backbone for my > streaming > > > CEP requirement. I have: > > > > > > 1. Multiple data enrichment and ETL jobs > > > 2. Multiple domain specific CEP rulesets > > > 3. Common/shared system services like consuming topics/streams and > > > persisting the messages in ElasticSearch and HDFS. > > > > > > My questions are: > > > > > > 1. Can I, or is it recommended to, package multiple jobs as 1 > deployment > > > with 1 properties file or keep each app separated? Based on the > > > documentation, it appears to support 1 app/job within a single > > > configuration as there is no mechanism to assign multiple app classes > and > > > given each a name unless I am mistaken. > > > 2. If only 1 app per config+deployment, what is the best way to handle > > > requirement #3 - common/shared system services as there is no app or > job > > > per say, I just need to specify the streams and output system (ie > > > org.apache.samza.system.hdfs.writer. > > > BinarySequenceFileHdfsWriter or > > > > > > > > > org.apache.samza.system.elasticsearch.indexrequest.DefaultIndexRequestFactory). > > > Given it's a common shared system service not tied to specific jobs, > can > > it > > > be deployed without an app? > > > > > > Thank you in advance for your help, looking forward to learning more > > about > > > Samza and developing this critical feature using Samza! > > > > > > Regards, > > > > > > Eric > > > > > >