Proposed update to partitioning logic

Ganelin, Ilya Fri, 27 May 2016 14:07:32 -0700

Hello all, as part of investigation 
https://issues.apache.org/jira/browse/APEXCORE-332 I dove pretty deep into the 
present implementation of the Partitioner, and the classes that leverage it 
including the PartitionAwareSink and DefaultPartition.


Given that this is a rather core utility of the Apex code as well as something 
that external implements interface with, I fully understand both the difficulty 
and trepidation around making changes. I would particularly love feedback from 
folks most familiar with it.

Current Implementation (my understanding)
The current approach to partitioning data is to define two objects: 1) A set of 
partition keys and 2) an integer mask. The basic idea is that there is an 
overarching set of partition keys, only some of which may be valid for a given 
port. Within the PartitionAwareSink which handles routing of the data, the 
class applies this mask to a “code” generated by an individual piece of data 
via an & operation and checks whether the set of partitions associated with a 
particular port will accept this data: 
partitions.contains(serde.getPartition(payload) & mask);

Problem:
The partitioning implementation is intimately tied in with the rest of the 
code. For example, the PartitionAwareSink presently needs to know about both 
the Set of partitions it is associated with and the keys it manages (via the 
mask) in order to accept a tuple. The StreamCodecWrapper also needs to know 
about both of these. The StreamingContainer stores and passes both these 
objects to the sub-classes that use it, e.g. Sinks.

Proposal:
Given that there is a need to implement different partitioning schemes, some of 
which may not be doable with a mask-based approach, e.g. Range partitioning to 
allow for partitioning data not in powers of 2, I propose abstracting away the 
implementation of the actual partitioning scheme from the components that use 
it. I believe this can be done with a fairly simple interface.  I believe we 
can add a class (e.g. PartitionAssigner) that is aware of the internal 
structure of a given Partition implementation but that provide a single method 
to outside classes “boolean accept(T object)”. In the current implementation, 
it would use the mask + partitionKeys approach to compute: 
partitions.contains(serde.getPartition(payload) & mask); In an alternative 
implementation, e.g. Range-based, it could check: for(Range r: partitions) { 
r.contains(serde.getPartition(payload)) }.

This would necessitate updating the Partition class to add something akin to a 
“getPartitionAssigner” and remove the accessors for partitionKeys (which 
includes a mask), and would thus affect existing implementations of the 
Partition class. Consequently, it may make sense to do this as part of the 
upcoming major 4.0 release. The advantage of doing this is that the code 
becomes much easier to parse and understand, easier to maintain because now in 
addition to having pluggable implementations of the Partitioner, we now also 
have pluggable implementations for Partition, and gives us the ability to deal 
with complex partitioning schemes. For example, we could easily create 
partitioning schemes where one partition sees 25% of the data, another sees 
50%, and a third sees 100%. This can be useful for evaluating performance of an 
algorithm under different load within the same application. Or, we have the 
ability to evenly split data across multiple partitions even if we have a 
number of partitions not equal to a power of two.

I would appreciate any feedback, and I’d be happy to whip together a PR 
demonstrating my proposal. My current PR here: 
https://github.com/apache/incubator-apex-core/pull/347/files demonstrates what 
a Range-based Partition implementation  would look like but also serves to show 
how many touch points there are with adding a different underlying 
implementation.


________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Proposed update to partitioning logic

Reply via email to