Ivan Bessonov created IGNITE-19133:
--------------------------------------
Summary: Increase partitions count upper bound
Key: IGNITE-19133
URL: https://issues.apache.org/jira/browse/IGNITE-19133
Project: Ignite
Issue Type: Improvement
Reporter: Ivan Bessonov
h3. Problem
Data partitioning is used to distribute data (hopefully) evenly across the
cluster and to provide necessary operation parallelism for the end user. As a
rule of thumb, one may consider allocating 256 partitions per Ignite node, in
order to achieve that.
This rule only scales up to a certain point. Imagine a cluster of 1000 nodes,
with a table that has 3 replicas of each partition (ability to lose 1 backup).
With current limit of 65500 partitions, the maximal number of partitions per
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our
scalability, according to aforementioned rule. To provide 256 partitions per
node, the user would have to:
* either increase the number of backups, which proportionally increases
required storage space (affects cost),
* or increase the total number of partitions up to about 85 thousands. This is
not possible right now.
h3. What's the reason of current limit
Disclaimer: I'm not the one who designed it, so my thoughts may be speculative
in some sense.
Short answer is: we need a number of partitions to fit into 2 bytes.
Long answer: in current implementation we have 1 to 1 correspondence between
logical partition id and physical partition id. We use the same value both in
affinity and in physical file name. This makes system simpler, and I believe
that simplicity is the real explanation of the restriction.
Why does it have to be 2 bytes, and not 3, for example. The key is the
structure of page identifiers in data regions:
{code:java}
+---------------------------+----------------+---------------------+----------------+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4
bytes) |
+---------------------------+----------------+---------------------+----------------+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger
doesn't feel right.
h3. Proposed solution
As mentioned, there are to components to the problem:
# One to one correspondence between partition ids.
# Hard limit in a single data region, caused by the page id layout.
There's not much we can do with component #2, because the implications are
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions
This leads us to the following problem: every single Ignite node can't have
more than 65500 partitions for a table (or distribution zone). So, imagine the
situation:
* user has a cluster with 3 nodes
* user tries to create distribution zone with 3 nodes, 3 replicas for each
partitions and 100000 partitions
While this is absurd, the configuration is still "valid", but it leads to 100k
partitions per node, which is impossible.
Such zone configurations must be banned. Such restriction doesn't seem
unreasonable. If a user wants to start so many partitions for such a small
cluster, they really don't understand what they're doing.
This naturally gives us a minimal number of nodes per the number of partitions,
as the following formula (assuming that you can't have 2 replicas of the same
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In
reality,
rendezvous affinity is uneven, so the real value must be checked when user
configures the number of nodes for specific distribution zone.
h4. Ties to rebalance
For this question I would probably need an assistance. While affinity
reassignment, each node may store more partition then it's stated in every
single distribution. What do I mean by this:
* imagine node having partitions 1, 2, and 3
* after the reassignment, the node has partitions 3, 4 and 5
Each individual distribution states that node only has 3 partitions, while
during the rebalance, it may store all 5: sending 1 and 2 to some node, and
receiving 4 and 5 from some different node.
Multiply that by a big factor, and it is possible to have situation, where
local number of partitions exceeds 65500. The only way to beat it, in my
opinion, is to lower the hard limit in affinity function to 32xxx per node,
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids
With that being said, all that's left is to map logical partition ids from the
range 0..N (where N is unlimited) to physical ids from the range 0..65500.
Such mapping is a local entity, encapsulated deep inside of the storage engine.
Simplest way to do so is to have a HashMap \{ logical -> physical } and to
increase physical partition id by 1 every time you insert a new value. If the
{{values()}} set is not continuous, one may occupy the gap, it's not too hard
to implement.
Of course, this correspondence must be persisted with data. When we ready
physical partition from the storage, we must be able to know its logical id.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)