Re: [DISCUSS] SEP-1: Semantics of ProcessorId in Samza

Navina R Fri, 24 Mar 2017 11:23:01 -0700

>  Seems like they're (usually) going to be used by the framework, are
pretty simple to write, and could probably be written as a common Util
method if we find them repetitive.
Hmm.. To be honest, I didn't see the value of it too.  When Xinyu suggested
this change, I assumed it was the newly accepted pattern going forward.
Perhaps, it makes sense when the API is user-facing. In this case, a util
class is simpler to use. Let's get a consensus on this pattern and then, I
will change it.


@Xinyu: Is there any specific advantage of using the static interface
method pattern for class loading and instance creation?

Cheers!
Navina




On Fri, Mar 24, 2017 at 11:06 AM, Prateek Maheshwari <
pmaheshw...@linkedin.com> wrote:

> Hi Navina,
>
> 1. Assuming the environment can put the processor ID in the config,
> ProcessorIdGenerator#generateProcessorId(Config config) makes sense.
> Passing all of Config is rather broad, but I don't think we have an
> environment specific subset class for config yet, so should be OK.
>
> 2. I don't yet see the value of putting the class-loading helper default
> methods in multiple public interfaces. Seems like they're (usually) going
> to be used by the framework, are pretty simple to write, and could probably
> be written as a common Util method if we find them repetitive. Maybe skip
> this method for now and add this once we have some clarity on this new
> pattern?
>
> If we keep it, let's name it "ProcessorIdGenerator#fromConfig(Config
> config)" to be consistent with "ApplicationRunner#fromConfig(Config
> config)"?
>
> 3. I think we're in agreement that the Processor doesn't need to have
> access to the ProcessorIdGenerator, only the JobCoordinator does. With
> JobCoordinator only exposing #getProcessorId I think we're good.
>
> +1 from me with these minor changes. Thanks for the proposal!
>
> Best,
> Prateek
>
> On Thu, Mar 23, 2017 at 11:35 PM, Navina R <navi.trin...@gmail.com> wrote:
>
>> Hi Prateek,
>> > 1. Do you have any examples of custom processor IDs? Wondering what
>> information/classes ProcessorIdGenerator would need to be able to generate
>> one.
>> Yeah. When I was trying to implement the proposal, I was wondering the
>> same thing as well. However, it might end up being specific to the
>> environment. In case of Yarn, I would expect the AppMaster to still specify
>> the processorID as an environment variable. In case of Rain (which is a
>> Linkedin specific deployment framework),it will probably be a combination
>> of sliceId and instanceId. Given these variations, I am not sure what can
>> be generic enough to encompass all these information. Maybe we can pass the
>> config to the instance factory. But, let me know if you have other ideas.
>>
>>
>>  > 2. The default "static" getProcessorIdGeneratorFromConfig should be
>> on the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer
>> removing the fromConfig suffix from the method name and calling it create
>> instead of get? Not sure what the convention here is.
>> You are right. It should be in ProcessorIdGenerator. I can remove the
>> fromConfig suffix. Should I just call it createInstance, similar to the
>> Java apis ? I am not sure what the convention is either.
>>
>> > I think the JobCoordinator should still only have a #getProcessorId
>> method instead of #getProcessorIdGenerator.
>> JobCoordinator still has getProcessorId. If I move the
>> getProcessorIdGenerator helper to ProcessorIdGenerator, it makes sense?
>>
>> > Theoretically, a processor/container doesn't need to generate multiple
>> IDs, it only needs to know its own
>> I believe the reasoning originated from SAMZA-881, where we identified
>> the requirements for running Samza in distributed mode and the
>> responsibilities of each component in a homogenous stream processing mode.
>> Theoretically speaking, it is the job coordinator that will assign
>> identifiers to its processor. Practically, since it is bound to the runtime
>> environment, it seems appropriate for the job coordinator to generate the
>> id. If you haven't read SAMZA-881, you should give it a read. If we go by
>> the assumption that a leader processor (in simpler terms, a central
>> authority) generates the JobModel, it needs to "know" the identifiers of
>> all processors.
>> An alternative model is be the one where the leader spawns the processors
>> and also, "assigns" identifiers for all processors (for example, in Yarn
>> today). The latter model is restrictive in that:
>> 1. it expects the leader to know the number of processors required in the
>> job
>> 2. leader processor is different from other processors, thus, making a
>> Samza job a more heterogenous set of processors
>>
>> Hope these points make sense. Jagadish can add more, and even correct me
>> if I got anything wrong here :)
>>
>> > 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as
>> synonyms. Let's update to use ProcessorIdGenerator consistently.
>> Will do
>>
>> > 2. Minor: 'processor.id' configuration: I'm assuming this still needs
>> to be unique for each processor in the job? If so, probably worth calling
>> out in the SEP or configuration docs. We can also document it as deprecated
>> and a candidate for removal in near future (maybe 0.14?).
>> Yes. That is still a requirement. I think I updated the document
>> regarding deprecating and removing it. I will re-check.
>>
>> Thanks for your comments.
>>
>> Cheers!
>> Navina
>>
>> On Thu, Mar 23, 2017 at 12:04 PM, Prateek Maheshwari <
>> pmaheshw...@linkedin.com> wrote:
>>
>>> Hi Navina,
>>>
>>> Thanks for SEP-1, looks pretty good to me. A few questions/comments:
>>>
>>> Implementation/Interface related:
>>> 1. Do you have any examples of custom processor IDs? Wondering what
>>> information/classes ProcessorIdGenerator would need to be able to generate
>>> one.
>>> 2. The default "static" getProcessorIdGeneratorFromConfig should be on
>>> the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer
>>> removing the fromConfig suffix from the method name and calling it create
>>> instead of get? Not sure what the convention here is. @Xinyu, any
>>> preferences?
>>> 3. +1 for removing the constructor parameter, but I think the
>>> JobCoordinator should still only have a #getProcessorId method instead of
>>> #getProcessorIdGenerator. Theoretically, a processor/container doesn't need
>>> to generate multiple IDs, it only needs to know its own. @Jagadish, prefer
>>> the more restrictive API unless you have a use case for the more general
>>> one in mind.
>>>
>>> Documentation/SEP related:
>>> 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as
>>> synonyms. Let's update to use ProcessorIdGenerator consistently.
>>> 2. Minor: 'processor.id' configuration: I'm assuming this still needs
>>> to be unique for each processor in the job? If so, probably worth calling
>>> out in the SEP or configuration docs. We can also document it as deprecated
>>> and a candidate for removal in near future (maybe 0.14?).
>>>
>>> Thanks,
>>> Prateek
>>>
>>>
>>> On Tue, Mar 21, 2017 at 2:34 PM, Navina Ramesh (Apache) <
>>> nav...@apache.org> wrote:
>>>
>>>> Hi everyone,
>>>> I have updated the SEP
>>>> <https://cwiki.apache.org/confluence/display/SAMZA/SEP-1%3A+
>>>> Semantics+of+ProcessorId+in+Samza>
>>>> based
>>>> on all the feedback. Feel free to comment.
>>>>
>>>> I will start the [vote] mail thread, if there are no further questions
>>>> within the next 24 hours.
>>>>
>>>> Thanks!
>>>> Navina
>>>>
>>>> On Tue, Mar 21, 2017 at 10:33 AM, Navina Ramesh (Apache) <
>>>> nav...@apache.org>
>>>> wrote:
>>>>
>>>> > Hi Jagadish,
>>>> > Thanks for the suggestion. You are right in that it should be the
>>>> > responsibility of the JobCoordinator to assign identifiers.
>>>> >
>>>> > > 'm only wondering if this logic could instead reside inside the
>>>> > Job Coordinator (which is internal to the StreamProcessor) instead of
>>>> > relying on something external to it?
>>>> >
>>>> > I think this is a consequence of our initial StandaloneJobCoordinator,
>>>> > which is pretty much a pass-through. I didn't see any usage for
>>>> > getProcessorId() and was wondering why we put it in the JobCoordinator
>>>> > interface. I think I should keep your design proposal from last year
>>>> handy
>>>> > :) Thanks for pitching in!
>>>> >
>>>> >
>>>> > @All:
>>>> > Yesterday, there was a discussion on naming of the configuration used
>>>> in
>>>> > this SEP - whether it should be within the "job" scope or "app" scope
>>>> > (introduced by SAMZA-1041
>>>> > <https://issues.apache.org/jira/browse/SAMZA-1041>).  Multi-stage
>>>> feature
>>>> > and fluent-api for Samza introduces the notion of "application".
>>>> Since the
>>>> > processorId generator config applies to all jobs within a Samza
>>>> > application, we decided to add the config for generator under "app"
>>>> scope.
>>>> > Further details on config scope changes can be found in SAMZA-1120.
>>>> > <https://issues.apache.org/jira/browse/SAMZA-1120>
>>>> >
>>>> > I will send out an update once I change the SEP based on yesterday's
>>>> > meeting and Jagadish's idea.
>>>> >
>>>> > Thanks!
>>>> > Navina
>>>> >
>>>> > On Mon, Mar 20, 2017 at 5:22 PM, Jagadish Venkatraman <
>>>> > jagadish1...@gmail.com> wrote:
>>>> >
>>>> >> Thanks for writing this SEP!
>>>> >>
>>>> >> Here's an alternate approach instead of taking the "String
>>>> processorId" as
>>>> >> a parameter in the constructor. In my view, the "processorId" could
>>>> be
>>>> >> generated by the StreamProcessor internally (instead of being
>>>> generated
>>>> >> up-stream and passed in). The Job Coordinator API could be as
>>>> follows:
>>>> >>
>>>> >>
>>>> >> public interface JobCoordinator {
>>>> >>
>>>> >>  ProcessorIdGenerator getProcessorIDGenerator();
>>>> >>
>>>> >> // could be String getProcessorID()
>>>> >>
>>>> >>  JobModel getJobModel();
>>>> >>
>>>> >> }
>>>> >>
>>>> >> public interface ProcessorIDGenerator {
>>>> >>
>>>> >>  String getProcessorID();
>>>> >> }
>>>> >>
>>>> >>
>>>> >> For instance, an Yarn job coordinator can merely parse the ID from
>>>> config,
>>>> >> and return it. A Zk backed implementation of the Job coordinator can
>>>> agree
>>>> >> on IDs using coordination leveraging Zk. One nice property with this
>>>> >> approach is that it keeps all logic related to coordination,
>>>> agreement on
>>>> >> the Job Model, leader election (with potentially pluggable
>>>> components for
>>>> >> each) inside the JobCoordinator.
>>>> >>
>>>> >> To be clear, I'm all for pluggability for ID generation logic that
>>>> this
>>>> >> SEP
>>>> >> advocates. I'm only wondering if this logic could instead reside
>>>> inside
>>>> >> the
>>>> >> Job Coordinator (which is internal to the StreamProcessor) instead of
>>>> >> relying on something external to it?
>>>> >>
>>>> >> Of course, there may be other considerations around the way the
>>>> current
>>>> >> code is structured that may prevent this. Let me know if you agree
>>>> with
>>>> >> this change.
>>>> >>
>>>> >> Thanks,
>>>> >> Jag
>>>> >>
>>>> >>
>>>> >> On Thu, Mar 16, 2017 at 5:21 PM, Navina Ramesh
>>>> >> <nram...@linkedin.com.invalid
>>>> >> > wrote:
>>>> >>
>>>> >> > > I am working on the ApplicationRunner SEP right now. Will send
>>>> out the
>>>> >> > discussion email once I am done.
>>>> >> >
>>>> >> > Perfect! :)
>>>> >> >
>>>> >> > On Thu, Mar 16, 2017 at 5:13 PM, xinyu liu <xinyuliu...@gmail.com>
>>>> >> wrote:
>>>> >> >
>>>> >> > > Right, the static factory is very simple as you said. It's pretty
>>>> >> > > convenient for the client to use.
>>>> >> > >
>>>> >> > > I am working on the ApplicationRunner SEP right now. Will send
>>>> out the
>>>> >> > > discussion email once I am done.
>>>> >> > >
>>>> >> > > Thanks,
>>>> >> > > Xinyu
>>>> >> > >
>>>> >> > > On Thu, Mar 16, 2017 at 4:50 PM, Navina Ramesh (Apache) <
>>>> >> > nav...@apache.org
>>>> >> > > >
>>>> >> > > wrote:
>>>> >> > >
>>>> >> > > > > One minor thing I found is that the name of the config is
>>>> camel
>>>> >> case
>>>> >> > > > (*processor.idGenerator.class*). Seems Samza's practice is to
>>>> use
>>>> >> all
>>>> >> > > > lower
>>>> >> > > > case configs with "." delimiter. Do you think we should stick
>>>> to
>>>> >> this
>>>> >> > > > convention?
>>>> >> > > >
>>>> >> > > > I am always torn between the "convention" we have and the
>>>> better
>>>> >> way of
>>>> >> > > > doing things. But I don't have strong opinions about it. I can
>>>> >> change
>>>> >> > it.
>>>> >> > > >
>>>> >> > > > > One more suggestion is to have a static factory method in the
>>>> >> > > > ProcessorIdGenerator (Like what we have in ApplicationRunner):
>>>> >> > > >
>>>> >> > > > I couldn't grasp these requirements from the ApplicationRunner
>>>> >> design.
>>>> >> > It
>>>> >> > > > will be great if you can put it out in an SEP :)
>>>> >> > > >
>>>> >> > > > I can add the static factory method for it. Just to clarify,
>>>> the
>>>> >> static
>>>> >> > > > method simply class loads the ProcessorIdGenerator ? It uses
>>>> >> reflection
>>>> >> > > to
>>>> >> > > > create the instance ?
>>>> >> > > >
>>>> >> > > > Thanks!
>>>> >> > > > Navina
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > > On Thu, Mar 16, 2017 at 4:31 PM, xinyu liu <
>>>> xinyuliu...@gmail.com>
>>>> >> > > wrote:
>>>> >> > > >
>>>> >> > > > > The proposal looks great to me! Changing the id type to
>>>> string
>>>> >> will
>>>> >> > > make
>>>> >> > > > > sure this can work with other types of cluster which doesn't
>>>> >> support
>>>> >> > > > > integer id. The interface and config provides a pluggable
>>>> way to
>>>> >> have
>>>> >> > > > > different id generators for different use cases. One minor
>>>> thing I
>>>> >> > > found
>>>> >> > > > is
>>>> >> > > > > that the name of the config is camel case
>>>> >> > > (*processor.idGenerator.class*
>>>> >> > > > ).
>>>> >> > > > > Seems Samza's practice is to use all lower case configs with
>>>> "."
>>>> >> > > > delimiter.
>>>> >> > > > > Do you think we should stick to this convention?
>>>> >> > > > >
>>>> >> > > > > One more suggestion is to have a static factory method in
>>>> >> > > > > the ProcessorIdGenerator (Like what we have in
>>>> ApplicationRunner):
>>>> >> > > > >
>>>> >> > > > > static ProcessIdGenerator fromConfig(Config config) { ... }.
>>>> >> > > > >
>>>> >> > > > > With this, It will be more convenient for the
>>>> ApplicationRunner to
>>>> >> > > > > construct the generator. What do you think?
>>>> >> > > > >
>>>> >> > > > > Thanks,
>>>> >> > > > > Xinyu
>>>> >> > > > >
>>>> >> > > > > On Wed, Mar 15, 2017 at 10:59 PM, Navina Ramesh (Apache) <
>>>> >> > > > > nav...@apache.org>
>>>> >> > > > > wrote:
>>>> >> > > > >
>>>> >> > > > > > Hi everyone,
>>>> >> > > > > > I created a proposal for SAMZA-1126, which addresses the
>>>> >> semantics
>>>> >> > of
>>>> >> > > > > > ProcessorId in Samza. For most purposes, ProcessorId is
>>>> same as
>>>> >> the
>>>> >> > > > > logical
>>>> >> > > > > > id that Samza assigns for each Yarn container. It is
>>>> primarily
>>>> >> used
>>>> >> > > in
>>>> >> > > > > > JobModel as a key for the corresponding ContainerModel and
>>>> >> also, in
>>>> >> > > > > > container-level metrics. We are expanding the
>>>> applicability of
>>>> >> > > > > processorId
>>>> >> > > > > > to be beyond a fixed set of processors.
>>>> >> > > > > >
>>>> >> > > > > > Please review and comment on this SEP.
>>>> >> > > > > >
>>>> >> > > > > > For those who are not actively following the master
>>>> branch, you
>>>> >> may
>>>> >> > > > have
>>>> >> > > > > > more questions than others. Feel free to ask them here.
>>>> >> > > > > >
>>>> >> > > > > > @Xinyu: Since you are working on SAMZA-1067 and other
>>>> related
>>>> >> > > > integration
>>>> >> > > > > > APIs, can you please add an SEP for SAMZA-1067 ? This will
>>>> help
>>>> >> > > others
>>>> >> > > > > (adn
>>>> >> > > > > > me as well) get on the same page with your design/code.
>>>> Let me
>>>> >> know
>>>> >> > > if
>>>> >> > > > > > SEP-1 will work per your design for ApplicationRunner.
>>>> >> > > > > >
>>>> >> > > > > > Thanks!
>>>> >> > > > > > Navina
>>>> >> > > > > >
>>>> >> > > > >
>>>> >> > > >
>>>> >> > >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Navina R.
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jagadish V,
>>>> >> Graduate Student,
>>>> >> Department of Computer Science,
>>>> >> Stanford University
>>>> >>
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] SEP-1: Semantics of ProcessorId in Samza

Reply via email to