[
https://issues.apache.org/jira/browse/SAMZA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984562#comment-13984562
]
Jakob Homan commented on SAMZA-123:
-----------------------------------
First, a couple meta notes with my Champion hat on, since this is Samza's first
reasonably sized code debate and we have lots of new-to-ASF community members:
* This type of involved, heavily quoted, polite but emphatic discussion is
healthy and encouraged. Everyone should feel free to get involved; don't be
intimidated by the length of the discussion thus far.
* It may take a while, but we'll reach consensus on whatever the implementation
is. These types of decisions need to be consensus rather than majority-vote to
ensure there are no losing sides (or any sides, really). Consensus means
everyone can live with the decision (not that they all love it though), whereas
voting means there would be a group who got shut out entirely, and that's not
healthy for the community.
Now, I think there are four broad groups of issues in play in this JIRA:
# Those we agree on, mainly that this is a useful feature to have the broad
approach as detailed in the design doc is the correct one for the moment.
# Those that questions have been raised about and have been answered, such as
those raised by [~jkreps], [~sriramsub] and [~criccomini]. Guys, are you ok
with answers provided above, except those still under discussion as described
below?
# Those that have been modified as part of the discussion (and I'll update the
design doc with the new description):
** GroupIntoNSets is confusing/not very useful as defined. Chris had a good
idea to set the N there to be the same as the number of containers. This is
likely its most common use case. This will move the code into some package
that's aware of YARN since it requires a YARN-specific config (or we need to
change the yarn container count config to be more generic, but that's out of
scope here).
** The bookkeeping necessary to support state and these new features should be
kept away from the state log itself and moved into some central location with
other per-job info (whatever form that takes)
# Finally, those questions on which we still do not have consensus:
** *Terminology: cohort.* Looks like the participants thus far are numerically
even, with those leaning against cohort more vehement than those for. I'm not
wild about shard, as I think that's a pretty key term for databases and may
lead people to think we this grouping functions the same way. Task Name has a
pretty big flaw it conflates these tasks with map-reduce tasks, when in fact
our closest analog is the Samza container. Since Samza is pitched as
map-reduce for streams, it's worth keeping our kinda-the-same-but-not-really
concepts as separate as possible lexically from map-reduce's.
** *Pluggable: yea or nay.* This feature dramatically increases the power of
the framework. It's a reasonable generalization of a previously hard-code
assumption about how to group inputs together. The provided interface is clean
and gives a Config for supporting per-implementation-values. It manipulates
bedrock classes and concepts (SSPs and tasks/TaskInstances), which are very
unlikely to change as Samza progresses. Having been heavily involved in
creating, maintaining, deprecating and
[codifying|https://issues.apache.org/jira/browse/HADOOP-5073] APIs in Hadoop,
which had quite a few production uses at the time, and in hashing out new APIs
in [ASF podlings|https://issues.apache.org/jira/browse/GIRAPH-83], I'd strongly
recommend a policy of openness now and clamp that down as the project grows
(particularly after a 1.0 release). Further, Samza is already pluggable in all
its significant functionality - serdes, state stores, lifecycle listeners,
checkpoint managers, metrics, stream job type, etc - that keeping this
functionality closed is big omission. The closest Hadoop analog class is the
[Partitioner|http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Partitioner.html],
which plays a key role in making that framework extensible and useful.
Samza's version should similarly be open and pluggable. Making it so now
allows us to advertise the fact and invite people to play with it - where they
can find and report its limitations if they exist - rather than bury it in the
code. Essentially, it's a chicken-egg thing - people won't use this feature
and find its limitations/build cooler things with Samza, if it's not a feature
we publicize and make available.
** *Where to store the extra info needed for this feature.* Non-Kafka-log
based solutions like ZK may be a good idea but a huge change that would blow up
the size of the patch and ignite lots more good discussion. Creating a general
purpose log (ConfigLog sounds good to me; it would hold the total config
necessary for any SC to be started and would be written by the AM and read by
each SC on startup) sounds like a good start. Immediate subsequent JIRAs can
determine the fate of the checkpoint log, what else should go into the
ConfigLog, etc. This approach seems like the smallest necessary change to
implement the current JIRA and doesn't expose anything that can't be changed in
the future, absent more discussion.
> Move topic partition grouping to the AM and generalize
> ------------------------------------------------------
>
> Key: SAMZA-123
> URL: https://issues.apache.org/jira/browse/SAMZA-123
> Project: Samza
> Issue Type: Sub-task
> Components: container
> Affects Versions: 0.6.0
> Reporter: Jakob Homan
> Assignee: Jakob Homan
> Attachments: SAMZA-123-design-doc.md, SAMZA-123-design-doc.pdf
>
>
> Currently the AM sends a set of all the topics and partitions to the
> container, which then groups them by partition and assigns each set to a task
> instance. By moving the grouping to the AM, we can assign arbitrary groups to
> task instances, which will allow more partitioning strategies, as discussed
> in SAMZA-71.
--
This message was sent by Atlassian JIRA
(v6.2#6252)