> Seems like they're (usually) going to be used by the framework, are pretty simple to write, and could probably be written as a common Util method if we find them repetitive. Hmm.. To be honest, I didn't see the value of it too. When Xinyu suggested this change, I assumed it was the newly accepted pattern going forward. Perhaps, it makes sense when the API is user-facing. In this case, a util class is simpler to use. Let's get a consensus on this pattern and then, I will change it.
@Xinyu: Is there any specific advantage of using the static interface method pattern for class loading and instance creation? Cheers! Navina On Fri, Mar 24, 2017 at 11:06 AM, Prateek Maheshwari < pmaheshw...@linkedin.com> wrote: > Hi Navina, > > 1. Assuming the environment can put the processor ID in the config, > ProcessorIdGenerator#generateProcessorId(Config config) makes sense. > Passing all of Config is rather broad, but I don't think we have an > environment specific subset class for config yet, so should be OK. > > 2. I don't yet see the value of putting the class-loading helper default > methods in multiple public interfaces. Seems like they're (usually) going > to be used by the framework, are pretty simple to write, and could probably > be written as a common Util method if we find them repetitive. Maybe skip > this method for now and add this once we have some clarity on this new > pattern? > > If we keep it, let's name it "ProcessorIdGenerator#fromConfig(Config > config)" to be consistent with "ApplicationRunner#fromConfig(Config > config)"? > > 3. I think we're in agreement that the Processor doesn't need to have > access to the ProcessorIdGenerator, only the JobCoordinator does. With > JobCoordinator only exposing #getProcessorId I think we're good. > > +1 from me with these minor changes. Thanks for the proposal! > > Best, > Prateek > > On Thu, Mar 23, 2017 at 11:35 PM, Navina R <navi.trin...@gmail.com> wrote: > >> Hi Prateek, >> > 1. Do you have any examples of custom processor IDs? Wondering what >> information/classes ProcessorIdGenerator would need to be able to generate >> one. >> Yeah. When I was trying to implement the proposal, I was wondering the >> same thing as well. However, it might end up being specific to the >> environment. In case of Yarn, I would expect the AppMaster to still specify >> the processorID as an environment variable. In case of Rain (which is a >> Linkedin specific deployment framework),it will probably be a combination >> of sliceId and instanceId. Given these variations, I am not sure what can >> be generic enough to encompass all these information. Maybe we can pass the >> config to the instance factory. But, let me know if you have other ideas. >> >> >> > 2. The default "static" getProcessorIdGeneratorFromConfig should be >> on the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer >> removing the fromConfig suffix from the method name and calling it create >> instead of get? Not sure what the convention here is. >> You are right. It should be in ProcessorIdGenerator. I can remove the >> fromConfig suffix. Should I just call it createInstance, similar to the >> Java apis ? I am not sure what the convention is either. >> >> > I think the JobCoordinator should still only have a #getProcessorId >> method instead of #getProcessorIdGenerator. >> JobCoordinator still has getProcessorId. If I move the >> getProcessorIdGenerator helper to ProcessorIdGenerator, it makes sense? >> >> > Theoretically, a processor/container doesn't need to generate multiple >> IDs, it only needs to know its own >> I believe the reasoning originated from SAMZA-881, where we identified >> the requirements for running Samza in distributed mode and the >> responsibilities of each component in a homogenous stream processing mode. >> Theoretically speaking, it is the job coordinator that will assign >> identifiers to its processor. Practically, since it is bound to the runtime >> environment, it seems appropriate for the job coordinator to generate the >> id. If you haven't read SAMZA-881, you should give it a read. If we go by >> the assumption that a leader processor (in simpler terms, a central >> authority) generates the JobModel, it needs to "know" the identifiers of >> all processors. >> An alternative model is be the one where the leader spawns the processors >> and also, "assigns" identifiers for all processors (for example, in Yarn >> today). The latter model is restrictive in that: >> 1. it expects the leader to know the number of processors required in the >> job >> 2. leader processor is different from other processors, thus, making a >> Samza job a more heterogenous set of processors >> >> Hope these points make sense. Jagadish can add more, and even correct me >> if I got anything wrong here :) >> >> > 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as >> synonyms. Let's update to use ProcessorIdGenerator consistently. >> Will do >> >> > 2. Minor: 'processor.id' configuration: I'm assuming this still needs >> to be unique for each processor in the job? If so, probably worth calling >> out in the SEP or configuration docs. We can also document it as deprecated >> and a candidate for removal in near future (maybe 0.14?). >> Yes. That is still a requirement. I think I updated the document >> regarding deprecating and removing it. I will re-check. >> >> Thanks for your comments. >> >> Cheers! >> Navina >> >> On Thu, Mar 23, 2017 at 12:04 PM, Prateek Maheshwari < >> pmaheshw...@linkedin.com> wrote: >> >>> Hi Navina, >>> >>> Thanks for SEP-1, looks pretty good to me. A few questions/comments: >>> >>> Implementation/Interface related: >>> 1. Do you have any examples of custom processor IDs? Wondering what >>> information/classes ProcessorIdGenerator would need to be able to generate >>> one. >>> 2. The default "static" getProcessorIdGeneratorFromConfig should be on >>> the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer >>> removing the fromConfig suffix from the method name and calling it create >>> instead of get? Not sure what the convention here is. @Xinyu, any >>> preferences? >>> 3. +1 for removing the constructor parameter, but I think the >>> JobCoordinator should still only have a #getProcessorId method instead of >>> #getProcessorIdGenerator. Theoretically, a processor/container doesn't need >>> to generate multiple IDs, it only needs to know its own. @Jagadish, prefer >>> the more restrictive API unless you have a use case for the more general >>> one in mind. >>> >>> Documentation/SEP related: >>> 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as >>> synonyms. Let's update to use ProcessorIdGenerator consistently. >>> 2. Minor: 'processor.id' configuration: I'm assuming this still needs >>> to be unique for each processor in the job? If so, probably worth calling >>> out in the SEP or configuration docs. We can also document it as deprecated >>> and a candidate for removal in near future (maybe 0.14?). >>> >>> Thanks, >>> Prateek >>> >>> >>> On Tue, Mar 21, 2017 at 2:34 PM, Navina Ramesh (Apache) < >>> nav...@apache.org> wrote: >>> >>>> Hi everyone, >>>> I have updated the SEP >>>> <https://cwiki.apache.org/confluence/display/SAMZA/SEP-1%3A+ >>>> Semantics+of+ProcessorId+in+Samza> >>>> based >>>> on all the feedback. Feel free to comment. >>>> >>>> I will start the [vote] mail thread, if there are no further questions >>>> within the next 24 hours. >>>> >>>> Thanks! >>>> Navina >>>> >>>> On Tue, Mar 21, 2017 at 10:33 AM, Navina Ramesh (Apache) < >>>> nav...@apache.org> >>>> wrote: >>>> >>>> > Hi Jagadish, >>>> > Thanks for the suggestion. You are right in that it should be the >>>> > responsibility of the JobCoordinator to assign identifiers. >>>> > >>>> > > 'm only wondering if this logic could instead reside inside the >>>> > Job Coordinator (which is internal to the StreamProcessor) instead of >>>> > relying on something external to it? >>>> > >>>> > I think this is a consequence of our initial StandaloneJobCoordinator, >>>> > which is pretty much a pass-through. I didn't see any usage for >>>> > getProcessorId() and was wondering why we put it in the JobCoordinator >>>> > interface. I think I should keep your design proposal from last year >>>> handy >>>> > :) Thanks for pitching in! >>>> > >>>> > >>>> > @All: >>>> > Yesterday, there was a discussion on naming of the configuration used >>>> in >>>> > this SEP - whether it should be within the "job" scope or "app" scope >>>> > (introduced by SAMZA-1041 >>>> > <https://issues.apache.org/jira/browse/SAMZA-1041>). Multi-stage >>>> feature >>>> > and fluent-api for Samza introduces the notion of "application". >>>> Since the >>>> > processorId generator config applies to all jobs within a Samza >>>> > application, we decided to add the config for generator under "app" >>>> scope. >>>> > Further details on config scope changes can be found in SAMZA-1120. >>>> > <https://issues.apache.org/jira/browse/SAMZA-1120> >>>> > >>>> > I will send out an update once I change the SEP based on yesterday's >>>> > meeting and Jagadish's idea. >>>> > >>>> > Thanks! >>>> > Navina >>>> > >>>> > On Mon, Mar 20, 2017 at 5:22 PM, Jagadish Venkatraman < >>>> > jagadish1...@gmail.com> wrote: >>>> > >>>> >> Thanks for writing this SEP! >>>> >> >>>> >> Here's an alternate approach instead of taking the "String >>>> processorId" as >>>> >> a parameter in the constructor. In my view, the "processorId" could >>>> be >>>> >> generated by the StreamProcessor internally (instead of being >>>> generated >>>> >> up-stream and passed in). The Job Coordinator API could be as >>>> follows: >>>> >> >>>> >> >>>> >> public interface JobCoordinator { >>>> >> >>>> >> ProcessorIdGenerator getProcessorIDGenerator(); >>>> >> >>>> >> // could be String getProcessorID() >>>> >> >>>> >> JobModel getJobModel(); >>>> >> >>>> >> } >>>> >> >>>> >> public interface ProcessorIDGenerator { >>>> >> >>>> >> String getProcessorID(); >>>> >> } >>>> >> >>>> >> >>>> >> For instance, an Yarn job coordinator can merely parse the ID from >>>> config, >>>> >> and return it. A Zk backed implementation of the Job coordinator can >>>> agree >>>> >> on IDs using coordination leveraging Zk. One nice property with this >>>> >> approach is that it keeps all logic related to coordination, >>>> agreement on >>>> >> the Job Model, leader election (with potentially pluggable >>>> components for >>>> >> each) inside the JobCoordinator. >>>> >> >>>> >> To be clear, I'm all for pluggability for ID generation logic that >>>> this >>>> >> SEP >>>> >> advocates. I'm only wondering if this logic could instead reside >>>> inside >>>> >> the >>>> >> Job Coordinator (which is internal to the StreamProcessor) instead of >>>> >> relying on something external to it? >>>> >> >>>> >> Of course, there may be other considerations around the way the >>>> current >>>> >> code is structured that may prevent this. Let me know if you agree >>>> with >>>> >> this change. >>>> >> >>>> >> Thanks, >>>> >> Jag >>>> >> >>>> >> >>>> >> On Thu, Mar 16, 2017 at 5:21 PM, Navina Ramesh >>>> >> <nram...@linkedin.com.invalid >>>> >> > wrote: >>>> >> >>>> >> > > I am working on the ApplicationRunner SEP right now. Will send >>>> out the >>>> >> > discussion email once I am done. >>>> >> > >>>> >> > Perfect! :) >>>> >> > >>>> >> > On Thu, Mar 16, 2017 at 5:13 PM, xinyu liu <xinyuliu...@gmail.com> >>>> >> wrote: >>>> >> > >>>> >> > > Right, the static factory is very simple as you said. It's pretty >>>> >> > > convenient for the client to use. >>>> >> > > >>>> >> > > I am working on the ApplicationRunner SEP right now. Will send >>>> out the >>>> >> > > discussion email once I am done. >>>> >> > > >>>> >> > > Thanks, >>>> >> > > Xinyu >>>> >> > > >>>> >> > > On Thu, Mar 16, 2017 at 4:50 PM, Navina Ramesh (Apache) < >>>> >> > nav...@apache.org >>>> >> > > > >>>> >> > > wrote: >>>> >> > > >>>> >> > > > > One minor thing I found is that the name of the config is >>>> camel >>>> >> case >>>> >> > > > (*processor.idGenerator.class*). Seems Samza's practice is to >>>> use >>>> >> all >>>> >> > > > lower >>>> >> > > > case configs with "." delimiter. Do you think we should stick >>>> to >>>> >> this >>>> >> > > > convention? >>>> >> > > > >>>> >> > > > I am always torn between the "convention" we have and the >>>> better >>>> >> way of >>>> >> > > > doing things. But I don't have strong opinions about it. I can >>>> >> change >>>> >> > it. >>>> >> > > > >>>> >> > > > > One more suggestion is to have a static factory method in the >>>> >> > > > ProcessorIdGenerator (Like what we have in ApplicationRunner): >>>> >> > > > >>>> >> > > > I couldn't grasp these requirements from the ApplicationRunner >>>> >> design. >>>> >> > It >>>> >> > > > will be great if you can put it out in an SEP :) >>>> >> > > > >>>> >> > > > I can add the static factory method for it. Just to clarify, >>>> the >>>> >> static >>>> >> > > > method simply class loads the ProcessorIdGenerator ? It uses >>>> >> reflection >>>> >> > > to >>>> >> > > > create the instance ? >>>> >> > > > >>>> >> > > > Thanks! >>>> >> > > > Navina >>>> >> > > > >>>> >> > > > >>>> >> > > > >>>> >> > > > On Thu, Mar 16, 2017 at 4:31 PM, xinyu liu < >>>> xinyuliu...@gmail.com> >>>> >> > > wrote: >>>> >> > > > >>>> >> > > > > The proposal looks great to me! Changing the id type to >>>> string >>>> >> will >>>> >> > > make >>>> >> > > > > sure this can work with other types of cluster which doesn't >>>> >> support >>>> >> > > > > integer id. The interface and config provides a pluggable >>>> way to >>>> >> have >>>> >> > > > > different id generators for different use cases. One minor >>>> thing I >>>> >> > > found >>>> >> > > > is >>>> >> > > > > that the name of the config is camel case >>>> >> > > (*processor.idGenerator.class* >>>> >> > > > ). >>>> >> > > > > Seems Samza's practice is to use all lower case configs with >>>> "." >>>> >> > > > delimiter. >>>> >> > > > > Do you think we should stick to this convention? >>>> >> > > > > >>>> >> > > > > One more suggestion is to have a static factory method in >>>> >> > > > > the ProcessorIdGenerator (Like what we have in >>>> ApplicationRunner): >>>> >> > > > > >>>> >> > > > > static ProcessIdGenerator fromConfig(Config config) { ... }. >>>> >> > > > > >>>> >> > > > > With this, It will be more convenient for the >>>> ApplicationRunner to >>>> >> > > > > construct the generator. What do you think? >>>> >> > > > > >>>> >> > > > > Thanks, >>>> >> > > > > Xinyu >>>> >> > > > > >>>> >> > > > > On Wed, Mar 15, 2017 at 10:59 PM, Navina Ramesh (Apache) < >>>> >> > > > > nav...@apache.org> >>>> >> > > > > wrote: >>>> >> > > > > >>>> >> > > > > > Hi everyone, >>>> >> > > > > > I created a proposal for SAMZA-1126, which addresses the >>>> >> semantics >>>> >> > of >>>> >> > > > > > ProcessorId in Samza. For most purposes, ProcessorId is >>>> same as >>>> >> the >>>> >> > > > > logical >>>> >> > > > > > id that Samza assigns for each Yarn container. It is >>>> primarily >>>> >> used >>>> >> > > in >>>> >> > > > > > JobModel as a key for the corresponding ContainerModel and >>>> >> also, in >>>> >> > > > > > container-level metrics. We are expanding the >>>> applicability of >>>> >> > > > > processorId >>>> >> > > > > > to be beyond a fixed set of processors. >>>> >> > > > > > >>>> >> > > > > > Please review and comment on this SEP. >>>> >> > > > > > >>>> >> > > > > > For those who are not actively following the master >>>> branch, you >>>> >> may >>>> >> > > > have >>>> >> > > > > > more questions than others. Feel free to ask them here. >>>> >> > > > > > >>>> >> > > > > > @Xinyu: Since you are working on SAMZA-1067 and other >>>> related >>>> >> > > > integration >>>> >> > > > > > APIs, can you please add an SEP for SAMZA-1067 ? This will >>>> help >>>> >> > > others >>>> >> > > > > (adn >>>> >> > > > > > me as well) get on the same page with your design/code. >>>> Let me >>>> >> know >>>> >> > > if >>>> >> > > > > > SEP-1 will work per your design for ApplicationRunner. >>>> >> > > > > > >>>> >> > > > > > Thanks! >>>> >> > > > > > Navina >>>> >> > > > > > >>>> >> > > > > >>>> >> > > > >>>> >> > > >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Navina R. >>>> >> > >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Jagadish V, >>>> >> Graduate Student, >>>> >> Department of Computer Science, >>>> >> Stanford University >>>> >> >>>> > >>>> > >>>> >>> >>> >> >