Hello all, as part of investigation
https://issues.apache.org/jira/browse/APEXCORE-332 I dove pretty deep into the
present implementation of the Partitioner, and the classes that leverage it
including the PartitionAwareSink and DefaultPartition.
Given that this is a rather core utility of the Apex code as well as something
that external implements interface with, I fully understand both the difficulty
and trepidation around making changes. I would particularly love feedback from
folks most familiar with it.
Current Implementation (my understanding)
The current approach to partitioning data is to define two objects: 1) A set of
partition keys and 2) an integer mask. The basic idea is that there is an
overarching set of partition keys, only some of which may be valid for a given
port. Within the PartitionAwareSink which handles routing of the data, the
class applies this mask to a “code” generated by an individual piece of data
via an & operation and checks whether the set of partitions associated with a
particular port will accept this data:
partitions.contains(serde.getPartition(payload) & mask);
Problem:
The partitioning implementation is intimately tied in with the rest of the
code. For example, the PartitionAwareSink presently needs to know about both
the Set of partitions it is associated with and the keys it manages (via the
mask) in order to accept a tuple. The StreamCodecWrapper also needs to know
about both of these. The StreamingContainer stores and passes both these
objects to the sub-classes that use it, e.g. Sinks.
Proposal:
Given that there is a need to implement different partitioning schemes, some of
which may not be doable with a mask-based approach, e.g. Range partitioning to
allow for partitioning data not in powers of 2, I propose abstracting away the
implementation of the actual partitioning scheme from the components that use
it. I believe this can be done with a fairly simple interface. I believe we
can add a class (e.g. PartitionAssigner) that is aware of the internal
structure of a given Partition implementation but that provide a single method
to outside classes “boolean accept(T object)”. In the current implementation,
it would use the mask + partitionKeys approach to compute:
partitions.contains(serde.getPartition(payload) & mask); In an alternative
implementation, e.g. Range-based, it could check: for(Range r: partitions) {
r.contains(serde.getPartition(payload)) }.
This would necessitate updating the Partition class to add something akin to a
“getPartitionAssigner” and remove the accessors for partitionKeys (which
includes a mask), and would thus affect existing implementations of the
Partition class. Consequently, it may make sense to do this as part of the
upcoming major 4.0 release. The advantage of doing this is that the code
becomes much easier to parse and understand, easier to maintain because now in
addition to having pluggable implementations of the Partitioner, we now also
have pluggable implementations for Partition, and gives us the ability to deal
with complex partitioning schemes. For example, we could easily create
partitioning schemes where one partition sees 25% of the data, another sees
50%, and a third sees 100%. This can be useful for evaluating performance of an
algorithm under different load within the same application. Or, we have the
ability to evenly split data across multiple partitions even if we have a
number of partitions not equal to a power of two.
I would appreciate any feedback, and I’d be happy to whip together a PR
demonstrating my proposal. My current PR here:
https://github.com/apache/incubator-apex-core/pull/347/files demonstrates what
a Range-based Partition implementation would look like but also serves to show
how many touch points there are with adding a different underlying
implementation.
________________________________________________________
The information contained in this e-mail is confidential and/or proprietary to
Capital One and/or its affiliates and may only be used solely in performance of
work or services for Capital One. The information transmitted herewith is
intended only for use by the individual or entity to which it is addressed. If
the reader of this message is not the intended recipient, you are hereby
notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.