Thanks for the comments and inputs, Ted, Josef and Kezhu. The discussions
are great.

> The invariant is that the value should be increasing except in
failure modes. You found a somewhat surprising failure mode.
Does this mean the value is supposed to be always increasing, never
decreasing or remaining constant, otherwise it's a failure mode and we need
to fix it?

> Please compute how long it would take for a 64-bit counter to overflow
if incremented a million times per second. (hint, half a million years).
Haha, it would take way too long (> 0.3M years) for a 64-bit counter to
overflow, so we don't need to worry about it.

> It would be better to use a longer integer.
>   1.  Increasing to a 64-bit counter certainly solves the problem, but
this might require conversion of zk data when the current counter is stored
as 32-bit
>   I support this, but it demands massive work and probably relates to
a long term goal[6].  The "int" fact of "cversion" is exposed both
in API(Stat) and storage(StatPersisted).
I agree. Using a 64-bit counter is a better way but the scope of the change
is big. If we have a consensus on this is the way to go, maybe we can scope
it out and break it into sub-tasks so we can have more people contribute to
it if there is a need.

> I wrote a test[1] and found that overflow of cversion results
in NODEEXISTS[2]. I think this is better than what the doc[3] says.
> There is no wrap-around on the client side.
NODEEXISTS is thrown because the same prefix is used in the test. If you
use different prefixes, you will get the wrap-around, as the child path is
different even if the sequence is the same.

> I think you could take a look at ZOOKEEPER-4332[4]. ZooKeeper does not
work well with massive children
Yes, to work around that, we only keep the most recent x number of children
and have a paginated way of listing the children.

> This breaks "monotonically increasing" and gains no"uniqueness". The
cversion will wrap around again given your cases.
It breaks "monotonically increasing" only at the point of overflow, it is
"monotonically increasing" before overflow and after overflow.
It gains more ""uniqueness" as the range of "uniqueness" is increased from
[0, 2147483647] to [-2147483648, 2147483647].
This is not ideal and I agree using a 64 counter would be better if we can.

Thanks,

Li

On Thu, Jun 15, 2023 at 9:31 PM Kezhu Wang <kez...@gmail.com> wrote:

> Hi,
>
> > Note: the counter used to store the next sequence number is a signed int
> (4bytes) maintained by the parent node, the counter will overflow when
> incremented beyond 2147483647 (resulting in a name "-2147483648").
>
> I wrote a test[1] and found that overflow of cversion results in
> NODEEXISTS[2]. I think this is better than what the doc[3] says. There
> is no wrap-around on the client side.
>
> > In our use case, we create *persistent* *sequential* nodes. We store the
> sequence id in the client application and use it as a globally unique id.
>
> Given that you have overflowed the cversion, I think you could take a
> look at ZOOKEEPER-4332[4]. ZooKeeper does not work well with massive
> children when listing. BookKeeper's HierarchicalLedgerManager[5] is a
> real world example for this.
>
>
> > New
> > ===
> > if (parentCVersion > currentParentCVersion
> >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > currentParentCVersion == Integer.MAX_VALUE) *{
> >                 parent.stat.setCversion(parent
> > CVersion);
> >                 parent.stat.setPzxid(zxid);
> >           }
>
> This breaks "monotonically increasing" and gains no "uniqueness". The
> cversion will wrap around again given your cases.
>
> > It would be better to use a longer integer.
> >   1.  Increasing to a 64-bit counter certainly solves the problem, but
> this might require conversion of zk data when the current counter is stored
> as 32-bit
>
> I support this, but it demands massive work and probably relates to a
> long term goal[6].  The "int" fact of "cversion" is exposed both in
> API(Stat) and storage(StatPersisted).
>
>
> [1]:
> https://github.com/kezhuw/zookeeper/commit/755b1168156c28e4fc2813be593ac67514e8bdc7#diff-1af986ce48b5d4bb4b8e51374a70cc6e109a04c70d9f450be3df8f302010341cR59
> [2]:
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L675
> [3]:
> https://zookeeper.apache.org/doc/r3.8.1/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
> [4]: https://issues.apache.org/jira/browse/ZOOKEEPER-4332
> [5]:
> https://bookkeeper.apache.org/docs/getting-started/concepts/#hierarchical-ledger-manager
> [6]: https://issues.apache.org/jira/browse/ZOOKEEPER-102
>
> Kezhu Wang
>
>
> On Fri, Jun 16, 2023 at 11:33 AM Josef Roehrl
> <josef.roe...@fuseforward.com.invalid> wrote:
> >
> > I wanted to add 2 things.
> >
> >
> >   1.  Increasing to a 64-bit counter certainly solves the problem, but
> this might require conversion of zk data when the current counter is stored
> as 32-bit
> >   2.  A client that relies on a unique version that it uses as a
> reference outside of zk should verify that a version that it receives does
> not already exist outside zk. This applies even if 1. is considered, should
> a zk quorum be reset or lose its data.
> >
> > Josef Roehrl
> > FuseForward | Senior Architect - Professional Services
> > [
> https://fuseforward.atlassian.net/wiki/download/attachments/512327681/image001.png?version=1&modificationDate=1537397840684&cacheVersion=1&api=v2
> ]
> > Website<
> https://fuseforward.com/?utm_source=Email%20Signature&utm_medium=email%20signature&utm_campaign=email%20signature>
> | Newsletter<
> https://fuseforward.com/subscribe-to-our-newsletter/?utm_source=Email%20Signature&utm_medium=Email%20Signature&utm_campaign=Email%20Signature>
> | Twitter<https://twitter.com/fuseforward> | LinkedIn<
> https://www.linkedin.com/company/fuseforward/?originalSubdomain=ca>
> >
> > ________________________________
> > From: Ted Dunning <ted.dunn...@gmail.com>
> > Sent: Thursday, June 15, 2023 6:53 PM
> > To: dev@zookeeper.apache.org <dev@zookeeper.apache.org>
> > Subject: Re: Name of sequence node is not unique
> >
> > The invariant is that the value should be increasing except in failure
> > modes. You found a somewhat surprising failure mode.
> >
> > Please compute how long it would take for a 64-bit counter to overflow if
> > incremented a million times per second. (hint, half a million years).
> > Remember that zk only does things at less than 100,000 per second
> >
> > On Thu, Jun 15, 2023, 17:03 Li Wang <li4w...@gmail.com> wrote:
> >
> > > Thanks a lot for your inputs, Ted.
> > >
> > > On Thu, Jun 15, 2023 at 2:52 PM Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > >
> > > > Breaking a semantic invariant is a dangerous solution here.
> > >
> > > Totally agree. We should not break a semantic invariant if there is
> one.
> > > What's the semantic invariant here and how ZK is supposed to behave in
> the
> > > overflow case?
> > >
> > >
> > > > It would be better to use a longer integer.
> > > >
> > > Yes, I thought about this too. Longer integer will overflow too, so the
> > > issue of not generating unique numbers will still exist.
> > >
> > > Best,
> > >
> > > Li
> > >
> > > >
> > > >
> > > >
> > > > On Thu, Jun 15, 2023, 13:35 Li Wang <li4w...@gmail.com> wrote:
> > > >
> > > > > Thanks for the response, Enrico.
> > > > >
> > > > > Please see comments below.
> > > > >
> > > > > On Thu, Jun 15, 2023 at 5:47 AM Enrico Olivelli <
> eolive...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Li,
> > > > > > thanks for reporting your problem.
> > > > > >
> > > > > > Most likely you have found a bug.
> > > > > >
> > > > > > I have one question, related to your use case,
> > > > > > is the problem that the numbers are not "unique" or that the
> number
> > > is
> > > > > > not monotonically increasing ?
> > > > > >
> > > > >
> > > > > Technically speaking, monotonically increasing means either always
> > > > > increasing or *remaining constant. *With tha*t, * the problem is
> only
> > > > > the numbers are not "unique'. In this case, the parent cversion
> > > > > remains 2147483647
> > > > > after reaching Integer.MAX_VALUE, not unique any more.
> > > > >
> > > > > >
> > > > > > Do you have 2147483647 concurrent sessions and you found that two
> > > > > > sessions got the same sequenceId ?
> > > > > > or are you storing the sequenceId somewhere and you use it as a
> > > > > > globally unique id, not only among the connected sessions but
> also
> > > > > > among all the sessions that are ever connected to the cluster ?
> > > > > >
> > > > >
> > > > > In our use case, we create *persistent* *sequential* nodes. We
> store
> > > the
> > > > > sequence id in the client application and use it as a globally
> unique
> > > id.
> > > > > Currently Zookeeper guarantees the following non-overflow case but
> not
> > > > > after overflow.
> > > > >
> > > > > 1. Monotonically increasing
> > > > > 2. Uniqueness
> > > > > 3. Sequentially increase by 1
> > > > >
> > > > > For customers who have an overflow use case and can handle the
> sequence
> > > > > number cycling in a circular fashion, how about having a simple
> patch
> > > > > to support it and handle the overflow case better?  The change is
> > > adding
> > > > a
> > > > > condition to allow the wraparound when it flows to negative. We can
> > > also
> > > > > have a property to control whether to add the additional condition
> if
> > > > > needed.
> > > > >
> > > > > New
> > > > > ===
> > > > > if (parentCVersion > currentParentCVersion
> > > > >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > > > > currentParentCVersion == Integer.MAX_VALUE) *{
> > > > >                 parent.stat.setCversion(parentCVersion);
> > > > >                 parent.stat.setPzxid(zxid);
> > > > >           }
> > > > >
> > > > > Current
> > > > > =====
> > > > > if (parentCVersion > parent.stat.getCversion()) {
> > > > >                 parent.stat.setCversion(parentCVersion);
> > > > >                 parent.stat.setPzxid(zxid);
> > > > >             }
> > > > >
> > > > > Please let me know what you think. Any input or feedback would be
> > > > > appreciated.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Li
> > > > >
> > > > >
> > > > > > Enrico
> > > > > >
> > > > > > Il giorno ven 9 giu 2023 alle ore 21:10 Li Wang <
> li4w...@gmail.com>
> > > ha
> > > > > > scritto:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > We are running 3.7.1 in production and running into an "issue"
> that
> > > > the
> > > > > > > names of sequence nodes are not unique after the counter hits
> the
> > > max
> > > > > int
> > > > > > > (i.e 2147483647) and overflows.  I would like to start a
> thread to
> > > > > > discuss
> > > > > > > the following
> > > > > > >
> > > > > > > 1. Is this a bug or "expected" behavior?
> > > > > > > 2. Is ZK supposed to support the overflow scenario and need to
> make
> > > > > sure
> > > > > > > the name is unique when overflow happens?
> > > > > > >
> > > > > > > The name is not unique after hitting the max int value because
> of
> > > we
> > > > > > > have the following in zk  code base:
> > > > > > >
> > > > > > > 1.  The cversion of parent znode is used to build the child
> name in
> > > > > > > PrepRequestProcessor
> > > > > > >
> > > > > > >         int parentCVersion = parentRecord.stat.getCversion();
> > > > > > >         if (createMode.isSequential()) {
> > > > > > >             path = path + String.format(Locale.ENGLISH,
> "%010d",
> > > > > > > parentCVersion);
> > > > > > >         }
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/
> > > > > > >
> > > java/org/apache/zookeeper/server/PrepRequestProcessor.java#L668-L671
> > > > > > >
> > > > > > >
> > > > > > > 2. The parent znode is read from either
> > > zks.outstandingChangesForPath
> > > > > map
> > > > > > > or zk database/datatree.
> > > > > > >
> > > > > > >            lastChange =
> zks.outstandingChangesForPath.get(path);
> > > > > > >             if (lastChange == null) {
> > > > > > >                 DataNode n = zks.getZKDatabase().getNode(path);
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L168-L170
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 3. The cversion of the parent node in
> outstandingChangesForPath map
> > > > is
> > > > > > > always updated  but not in zk database as we added the
> following
> > > code
> > > > > in
> > > > > > 3.6
> > > > > > >
> > > > > > >             if (parentCVersion > parent.stat.getCversion()) {
> > > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > > >                 parent.stat.setPzxid(zxid);
> > > > > > >             }
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L477-L480
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3249
> > > > > > >
> > > > > > >
> > > > > > > When overflow happens, the new parentCversion is changed to
> > > > > -2147483648.
> > > > > > > It's updated in the outstandingChangesForPath map. It's not
> updated
> > > > in
> > > > > > > DataTree and the value stays as 2147483647  because
> -2147483648 is
> > > > less
> > > > > > > than 2147483647, so the cVerson is inconsistent in  ZK.
> > > > > > >
> > > > > > > Due to the inconsistent cVersion, when the next request comes
> in
> > > > after
> > > > > > > overflow, the sequence number is non-deterministic and not
> unique
> > > > > > depending
> > > > > > > on where the parent node is read from.  It can be 2147483647
> if the
> > > > > > > parent node is read from DataTree or -2147483648,  -2147483647
> and
> > > so
> > > > > on
> > > > > > if
> > > > > > > it's from the outstandingChangesForPath map.
> > > > > > >
> > > > > > > We have the following doc about unique naming but no info on
> > > > > "expected"
> > > > > > > behavior after overflow.
> > > > > > >
> > > > > > > Sequence Nodes -- Unique Naming
> > > > > > >
> > > > > > >
> > > > > > > When creating a znode you can also request that ZooKeeper
> append a
> > > > > > > monotonically increasing counter to the end of path. This
> counter
> > > is
> > > > > > unique
> > > > > > > to the parent znode. The counter has a format of %010d -- that
> is
> > > 10
> > > > > > digits
> > > > > > > with 0 (zero) padding (the counter is formatted in this way to
> > > > simplify
> > > > > > > sorting), i.e. "0000000001". See Queue Recipe for an example
> use of
> > > > > this
> > > > > > > feature. Note: the counter used to store the next sequence
> number
> > > is
> > > > a
> > > > > > > signed int (4bytes) maintained by the parent node, the counter
> will
> > > > > > > overflow when incremented beyond 2147483647 (resulting in a
> name
> > > > > > > "-2147483648").
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Please let me know if you have any comments or inputs.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > >
> > > > > > > Li
> > > > > >
> > > > >
> > > >
> > >
>

Reply via email to