[jira] [Commented] (SAMZA-123) Move topic partition grouping to the AM and generalize

Jakob Homan (JIRA) Tue, 29 Apr 2014 10:44:59 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984562#comment-13984562
 ]


Jakob Homan commented on SAMZA-123:
-----------------------------------

First, a couple meta notes with my Champion hat on, since this is Samza's first 
reasonably sized code debate and we have lots of new-to-ASF community members:
* This type of involved, heavily quoted, polite but emphatic discussion is 
healthy and encouraged.  Everyone should feel free to get involved; don't be 
intimidated by the length of the discussion thus far.
* It may take a while, but we'll reach consensus on whatever the implementation 
is.  These types of decisions need to be consensus rather than majority-vote to 
ensure there are no losing sides (or any sides, really).  Consensus means 
everyone can live with the decision (not that they all love it though), whereas 
voting means there would be a group who got shut out entirely, and that's not 
healthy for the community.

Now, I think there are four broad groups of issues in play in this JIRA:
# Those we agree on, mainly that this is a useful feature to have the broad 
approach as detailed in the design doc is the correct one for the moment.
# Those that questions have been raised about and have been answered, such as 
those raised by [~jkreps], [~sriramsub] and [~criccomini].  Guys, are you ok 
with answers provided above, except those still under discussion as described 
below?
# Those that have been modified as part of the discussion (and I'll update the 
design doc with the new description):
** GroupIntoNSets is confusing/not very useful as defined.  Chris had a good 
idea to set the N there to be the same as the number of containers.  This is 
likely its most common use case.  This will move the code into some package 
that's aware of YARN since it requires a YARN-specific config (or we need to 
change the yarn container count config to be more generic, but that's out of 
scope here).
** The bookkeeping necessary to support state and these new features should be 
kept away from the state log itself and moved into some central location with 
other per-job info (whatever form that takes)
# Finally, those questions on which we still do not have consensus:
** *Terminology: cohort.*  Looks like the participants thus far are numerically 
even, with those leaning against cohort more vehement than those for.  I'm not 
wild about shard, as I think that's a pretty key term for databases and may 
lead people to think we this grouping functions the same way.  Task Name has a 
pretty big flaw it conflates these tasks with map-reduce tasks, when in fact 
our closest analog is the Samza container.  Since Samza is pitched as 
map-reduce for streams, it's worth keeping our kinda-the-same-but-not-really 
concepts as separate as possible lexically from map-reduce's. 
** *Pluggable: yea or nay.*  This feature dramatically increases the power of 
the framework.  It's a reasonable generalization of a previously hard-code 
assumption about how to group inputs together.  The provided interface is clean 
and gives a Config for supporting per-implementation-values.  It manipulates 
bedrock classes and concepts (SSPs and tasks/TaskInstances), which are very 
unlikely to change as Samza progresses.  Having been heavily involved in 
creating, maintaining, deprecating and 
[codifying|https://issues.apache.org/jira/browse/HADOOP-5073] APIs in Hadoop, 
which had quite a few production uses at the time, and in hashing out new APIs 
in [ASF podlings|https://issues.apache.org/jira/browse/GIRAPH-83], I'd strongly 
recommend a policy of openness now and clamp that down as the project grows 
(particularly after a 1.0 release).  Further, Samza is already pluggable in all 
its significant functionality - serdes, state stores, lifecycle listeners, 
checkpoint managers, metrics, stream job type, etc - that keeping this 
functionality closed is big omission.  The closest Hadoop analog class is the 
[Partitioner|http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Partitioner.html],
 which plays a key role in making that framework extensible and useful.  
Samza's version should similarly be open and pluggable.  Making it so now 
allows us to advertise the fact and invite people to play with it - where they 
can find and report its limitations if they exist - rather than bury it in the 
code.  Essentially, it's a chicken-egg thing - people won't use this feature 
and find its limitations/build cooler things with Samza, if it's not a feature 
we publicize and make available.
** *Where to store the extra info needed for this feature.*  Non-Kafka-log 
based solutions like ZK may be a good idea but a huge change that would blow up 
the size of the patch and ignite lots more good discussion.  Creating a general 
purpose log (ConfigLog sounds good to me; it would hold the total config 
necessary for any SC to be started and would be written by the AM and read by 
each SC on startup) sounds like a good start.  Immediate subsequent JIRAs can 
determine the fate of the checkpoint log, what else should go into the 
ConfigLog, etc.  This approach seems like the smallest necessary change to 
implement the current JIRA and doesn't expose anything that can't be changed in 
the future, absent more discussion.

> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
>                 Key: SAMZA-123
>                 URL: https://issues.apache.org/jira/browse/SAMZA-123
>             Project: Samza
>          Issue Type: Sub-task
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the 
> container, which then groups them by partition and assigns each set to a task 
> instance. By moving the grouping to the AM, we can assign arbitrary groups to 
> task instances, which will allow more partitioning strategies, as discussed 
> in SAMZA-71.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-123) Move topic partition grouping to the AM and generalize

Reply via email to