I can't remember if the API in question returns the int in question directly or as part of a node name.
If the value is embedded in a string node name, then no protocol difference would be required. You would still get a unique string. It would just not be parseable back as a valid int. A doc warning to that effect would probably be sufficient. If the value is returned as an int, we could add a new call that returns a long. The int version could be *slightly* changed so that it throws an error if the counter has exceeded 2^31. Users who never could hit the limit (essentially everybody using ZK normally) would never hit the corner condition. People who were on the way to hit the error condition will get an informative error. And people who understand the problem will use the long version and never have the problem. In the end, however, I really do think that this is not something that is going to impact any realistic use of the software. A sustained average creation rate of >1 node per second is something that I have never seen and I would strong suggest that if I were to see such a use that it is substantially in error. If you want to create unique values, it is much better to have a fast process that generates 128 bit id's and just use ZK to update the next starting point each time the generating process restarts. This can easily decrease the ZK load by a factor of 10^6 or more and gives much higher performance than Zk alone ever could. For instance, on process restart, increment the next starting point by 10^9 and start generating unique values. If you have 1000 such generators, each will increment the starting point and overall, the starting point will be incremented by 10^12 \approx 2^40. With a 128 bit result, that leaves you with 2^80 restarts before you have a risk of collision. This configuration could probably generate 10^9 unique values per second for the remaining life of the earth. On Fri, Jun 16, 2023 at 3:43 AM Enrico Olivelli <eolive...@gmail.com> wrote: > Li, > > Il giorno ven 16 giu 2023 alle ore 11:43 Li Wang <li4w...@gmail.com> ha > scritto: > > > > Thanks for the comments and inputs, Ted, Josef and Kezhu. The discussions > > are great. > > > > > The invariant is that the value should be increasing except in > > failure modes. You found a somewhat surprising failure mode. > > Does this mean the value is supposed to be always increasing, never > > decreasing or remaining constant, otherwise it's a failure mode and we > need > > to fix it? > > > > > Please compute how long it would take for a 64-bit counter to overflow > > if incremented a million times per second. (hint, half a million years). > > Haha, it would take way too long (> 0.3M years) for a 64-bit counter to > > overflow, so we don't need to worry about it. > > > > > It would be better to use a longer integer. > > > 1. Increasing to a 64-bit counter certainly solves the problem, but > > this might require conversion of zk data when the current counter is > stored > > as 32-bit > > > I support this, but it demands massive work and probably relates to > > a long term goal[6]. The "int" fact of "cversion" is exposed both > > in API(Stat) and storage(StatPersisted). > > I agree. Using a 64-bit counter is a better way but the scope of the > change > > is big. If we have a consensus on this is the way to go, maybe we can > scope > > it out and break it into sub-tasks so we can have more people contribute > to > > it if there is a need. > > > > > I wrote a test[1] and found that overflow of cversion results > > in NODEEXISTS[2]. I think this is better than what the doc[3] says. > > > There is no wrap-around on the client side. > > NODEEXISTS is thrown because the same prefix is used in the test. If you > > use different prefixes, you will get the wrap-around, as the child path > is > > different even if the sequence is the same. > > > > > I think you could take a look at ZOOKEEPER-4332[4]. ZooKeeper does not > > work well with massive children > > Yes, to work around that, we only keep the most recent x number of > children > > and have a paginated way of listing the children. > > > > > This breaks "monotonically increasing" and gains no"uniqueness". The > > cversion will wrap around again given your cases. > > It breaks "monotonically increasing" only at the point of overflow, it is > > "monotonically increasing" before overflow and after overflow. > > It gains more ""uniqueness" as the range of "uniqueness" is increased > from > > [0, 2147483647] to [-2147483648, 2147483647]. > > This is not ideal and I agree using a 64 counter would be better if we > can. > > I may be fine with a proposal that allows you to extend the range of > values, but I am not sure that allowing it > would make much difference or introduce more problems: we are seeing > that the sequence will no longer be monotonically increasing, so we. > lose a feature > > I am supporting the initiative to move the counter to be 64 bits, as > far as we are able to > ensure 100% disk and wire protocol compatibility with clusters and > applications that do not overflow. > > I am still a little bit worried about your use case. > I may be wrong that the ids generated by ZooKeeper are not meant to be > globally unique outside of the scope of "coordination" of the clients > connected to the cluster. > For instance it is not expected that you use ZooKeeper to generate a > "customer id" or a "session id" (where the session is a session with a > web browser). > > It would be better to store a "counter" and use "compare and set", > but, again, zookeeper is not meant to be used on "hot paths", there > are many systems built over zookeeper > that provide higher level features with high performance (like > BookKeeper, or let me cite HerdDB, a small but powerful project I > worked on some time ago). > > So in the end my suggestions are: > - work together on the 64 bits counter > - rework your application to not use ZK to generate globally unique > ids that are not used for coordination > > Enrico > > > > > Thanks, > > > > Li > > > > On Thu, Jun 15, 2023 at 9:31 PM Kezhu Wang <kez...@gmail.com> wrote: > > > > > Hi, > > > > > > > Note: the counter used to store the next sequence number is a signed > int > > > (4bytes) maintained by the parent node, the counter will overflow when > > > incremented beyond 2147483647 (resulting in a name "-2147483648"). > > > > > > I wrote a test[1] and found that overflow of cversion results in > > > NODEEXISTS[2]. I think this is better than what the doc[3] says. There > > > is no wrap-around on the client side. > > > > > > > In our use case, we create *persistent* *sequential* nodes. We store > the > > > sequence id in the client application and use it as a globally unique > id. > > > > > > Given that you have overflowed the cversion, I think you could take a > > > look at ZOOKEEPER-4332[4]. ZooKeeper does not work well with massive > > > children when listing. BookKeeper's HierarchicalLedgerManager[5] is a > > > real world example for this. > > > > > > > > > > New > > > > === > > > > if (parentCVersion > currentParentCVersion > > > > *|| parentCVersion == Integer.MIN_VALUE && > > > > currentParentCVersion == Integer.MAX_VALUE) *{ > > > > parent.stat.setCversion(parent > > > > CVersion); > > > > parent.stat.setPzxid(zxid); > > > > } > > > > > > This breaks "monotonically increasing" and gains no "uniqueness". The > > > cversion will wrap around again given your cases. > > > > > > > It would be better to use a longer integer. > > > > 1. Increasing to a 64-bit counter certainly solves the problem, > but > > > this might require conversion of zk data when the current counter is > stored > > > as 32-bit > > > > > > I support this, but it demands massive work and probably relates to a > > > long term goal[6]. The "int" fact of "cversion" is exposed both in > > > API(Stat) and storage(StatPersisted). > > > > > > > > > [1]: > > > > https://github.com/kezhuw/zookeeper/commit/755b1168156c28e4fc2813be593ac67514e8bdc7#diff-1af986ce48b5d4bb4b8e51374a70cc6e109a04c70d9f450be3df8f302010341cR59 > > > [2]: > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L675 > > > [3]: > > > > https://zookeeper.apache.org/doc/r3.8.1/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming > > > [4]: https://issues.apache.org/jira/browse/ZOOKEEPER-4332 > > > [5]: > > > > https://bookkeeper.apache.org/docs/getting-started/concepts/#hierarchical-ledger-manager > > > [6]: https://issues.apache.org/jira/browse/ZOOKEEPER-102 > > > > > > Kezhu Wang > > > > > > > > > On Fri, Jun 16, 2023 at 11:33 AM Josef Roehrl > > > <josef.roe...@fuseforward.com.invalid> wrote: > > > > > > > > I wanted to add 2 things. > > > > > > > > > > > > 1. Increasing to a 64-bit counter certainly solves the problem, > but > > > this might require conversion of zk data when the current counter is > stored > > > as 32-bit > > > > 2. A client that relies on a unique version that it uses as a > > > reference outside of zk should verify that a version that it receives > does > > > not already exist outside zk. This applies even if 1. is considered, > should > > > a zk quorum be reset or lose its data. > > > > > > > > Josef Roehrl > > > > FuseForward | Senior Architect - Professional Services > > > > [ > > > > https://fuseforward.atlassian.net/wiki/download/attachments/512327681/image001.png?version=1&modificationDate=1537397840684&cacheVersion=1&api=v2 > > > ] > > > > Website< > > > > https://fuseforward.com/?utm_source=Email%20Signature&utm_medium=email%20signature&utm_campaign=email%20signature > > > > > | Newsletter< > > > > https://fuseforward.com/subscribe-to-our-newsletter/?utm_source=Email%20Signature&utm_medium=Email%20Signature&utm_campaign=Email%20Signature > > > > > | Twitter<https://twitter.com/fuseforward> | LinkedIn< > > > https://www.linkedin.com/company/fuseforward/?originalSubdomain=ca> > > > > > > > > ________________________________ > > > > From: Ted Dunning <ted.dunn...@gmail.com> > > > > Sent: Thursday, June 15, 2023 6:53 PM > > > > To: dev@zookeeper.apache.org <dev@zookeeper.apache.org> > > > > Subject: Re: Name of sequence node is not unique > > > > > > > > The invariant is that the value should be increasing except in > failure > > > > modes. You found a somewhat surprising failure mode. > > > > > > > > Please compute how long it would take for a 64-bit counter to > overflow if > > > > incremented a million times per second. (hint, half a million years). > > > > Remember that zk only does things at less than 100,000 per second > > > > > > > > On Thu, Jun 15, 2023, 17:03 Li Wang <li4w...@gmail.com> wrote: > > > > > > > > > Thanks a lot for your inputs, Ted. > > > > > > > > > > On Thu, Jun 15, 2023 at 2:52 PM Ted Dunning <ted.dunn...@gmail.com > > > > > wrote: > > > > > > > > > > > Breaking a semantic invariant is a dangerous solution here. > > > > > > > > > > Totally agree. We should not break a semantic invariant if there is > > > one. > > > > > What's the semantic invariant here and how ZK is supposed to > behave in > > > the > > > > > overflow case? > > > > > > > > > > > > > > > > It would be better to use a longer integer. > > > > > > > > > > > Yes, I thought about this too. Longer integer will overflow too, > so the > > > > > issue of not generating unique numbers will still exist. > > > > > > > > > > Best, > > > > > > > > > > Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 15, 2023, 13:35 Li Wang <li4w...@gmail.com> wrote: > > > > > > > > > > > > > Thanks for the response, Enrico. > > > > > > > > > > > > > > Please see comments below. > > > > > > > > > > > > > > On Thu, Jun 15, 2023 at 5:47 AM Enrico Olivelli < > > > eolive...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Li, > > > > > > > > thanks for reporting your problem. > > > > > > > > > > > > > > > > Most likely you have found a bug. > > > > > > > > > > > > > > > > I have one question, related to your use case, > > > > > > > > is the problem that the numbers are not "unique" or that the > > > number > > > > > is > > > > > > > > not monotonically increasing ? > > > > > > > > > > > > > > > > > > > > > > Technically speaking, monotonically increasing means either > always > > > > > > > increasing or *remaining constant. *With tha*t, * the problem > is > > > only > > > > > > > the numbers are not "unique'. In this case, the parent cversion > > > > > > > remains 2147483647 > > > > > > > after reaching Integer.MAX_VALUE, not unique any more. > > > > > > > > > > > > > > > > > > > > > > > Do you have 2147483647 concurrent sessions and you found > that two > > > > > > > > sessions got the same sequenceId ? > > > > > > > > or are you storing the sequenceId somewhere and you use it > as a > > > > > > > > globally unique id, not only among the connected sessions but > > > also > > > > > > > > among all the sessions that are ever connected to the > cluster ? > > > > > > > > > > > > > > > > > > > > > > In our use case, we create *persistent* *sequential* nodes. We > > > store > > > > > the > > > > > > > sequence id in the client application and use it as a globally > > > unique > > > > > id. > > > > > > > Currently Zookeeper guarantees the following non-overflow case > but > > > not > > > > > > > after overflow. > > > > > > > > > > > > > > 1. Monotonically increasing > > > > > > > 2. Uniqueness > > > > > > > 3. Sequentially increase by 1 > > > > > > > > > > > > > > For customers who have an overflow use case and can handle the > > > sequence > > > > > > > number cycling in a circular fashion, how about having a simple > > > patch > > > > > > > to support it and handle the overflow case better? The change > is > > > > > adding > > > > > > a > > > > > > > condition to allow the wraparound when it flows to negative. > We can > > > > > also > > > > > > > have a property to control whether to add the additional > condition > > > if > > > > > > > needed. > > > > > > > > > > > > > > New > > > > > > > === > > > > > > > if (parentCVersion > currentParentCVersion > > > > > > > *|| parentCVersion == Integer.MIN_VALUE && > > > > > > > currentParentCVersion == Integer.MAX_VALUE) *{ > > > > > > > parent.stat.setCversion(parentCVersion); > > > > > > > parent.stat.setPzxid(zxid); > > > > > > > } > > > > > > > > > > > > > > Current > > > > > > > ===== > > > > > > > if (parentCVersion > parent.stat.getCversion()) { > > > > > > > parent.stat.setCversion(parentCVersion); > > > > > > > parent.stat.setPzxid(zxid); > > > > > > > } > > > > > > > > > > > > > > Please let me know what you think. Any input or feedback would > be > > > > > > > appreciated. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Li > > > > > > > > > > > > > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > Il giorno ven 9 giu 2023 alle ore 21:10 Li Wang < > > > li4w...@gmail.com> > > > > > ha > > > > > > > > scritto: > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > We are running 3.7.1 in production and running into an > "issue" > > > that > > > > > > the > > > > > > > > > names of sequence nodes are not unique after the counter > hits > > > the > > > > > max > > > > > > > int > > > > > > > > > (i.e 2147483647) and overflows. I would like to start a > > > thread to > > > > > > > > discuss > > > > > > > > > the following > > > > > > > > > > > > > > > > > > 1. Is this a bug or "expected" behavior? > > > > > > > > > 2. Is ZK supposed to support the overflow scenario and > need to > > > make > > > > > > > sure > > > > > > > > > the name is unique when overflow happens? > > > > > > > > > > > > > > > > > > The name is not unique after hitting the max int value > because > > > of > > > > > we > > > > > > > > > have the following in zk code base: > > > > > > > > > > > > > > > > > > 1. The cversion of parent znode is used to build the child > > > name in > > > > > > > > > PrepRequestProcessor > > > > > > > > > > > > > > > > > > int parentCVersion = > parentRecord.stat.getCversion(); > > > > > > > > > if (createMode.isSequential()) { > > > > > > > > > path = path + String.format(Locale.ENGLISH, > > > "%010d", > > > > > > > > > parentCVersion); > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/ > > > > > > > > > > > > > > > java/org/apache/zookeeper/server/PrepRequestProcessor.java#L668-L671 > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The parent znode is read from either > > > > > zks.outstandingChangesForPath > > > > > > > map > > > > > > > > > or zk database/datatree. > > > > > > > > > > > > > > > > > > lastChange = > > > zks.outstandingChangesForPath.get(path); > > > > > > > > > if (lastChange == null) { > > > > > > > > > DataNode n = > zks.getZKDatabase().getNode(path); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L168-L170 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. The cversion of the parent node in > > > outstandingChangesForPath map > > > > > > is > > > > > > > > > always updated but not in zk database as we added the > > > following > > > > > code > > > > > > > in > > > > > > > > 3.6 > > > > > > > > > > > > > > > > > > if (parentCVersion > > parent.stat.getCversion()) { > > > > > > > > > parent.stat.setCversion(parentCVersion); > > > > > > > > > parent.stat.setPzxid(zxid); > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L477-L480 > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3249 > > > > > > > > > > > > > > > > > > > > > > > > > > > When overflow happens, the new parentCversion is changed to > > > > > > > -2147483648. > > > > > > > > > It's updated in the outstandingChangesForPath map. It's not > > > updated > > > > > > in > > > > > > > > > DataTree and the value stays as 2147483647 because > > > -2147483648 is > > > > > > less > > > > > > > > > than 2147483647, so the cVerson is inconsistent in ZK. > > > > > > > > > > > > > > > > > > Due to the inconsistent cVersion, when the next request > comes > > > in > > > > > > after > > > > > > > > > overflow, the sequence number is non-deterministic and not > > > unique > > > > > > > > depending > > > > > > > > > on where the parent node is read from. It can be > 2147483647 > > > if the > > > > > > > > > parent node is read from DataTree or -2147483648, > -2147483647 > > > and > > > > > so > > > > > > > on > > > > > > > > if > > > > > > > > > it's from the outstandingChangesForPath map. > > > > > > > > > > > > > > > > > > We have the following doc about unique naming but no info > on > > > > > > > "expected" > > > > > > > > > behavior after overflow. > > > > > > > > > > > > > > > > > > Sequence Nodes -- Unique Naming > > > > > > > > > > > > > > > > > > > > > > > > > > > When creating a znode you can also request that ZooKeeper > > > append a > > > > > > > > > monotonically increasing counter to the end of path. This > > > counter > > > > > is > > > > > > > > unique > > > > > > > > > to the parent znode. The counter has a format of %010d -- > that > > > is > > > > > 10 > > > > > > > > digits > > > > > > > > > with 0 (zero) padding (the counter is formatted in this > way to > > > > > > simplify > > > > > > > > > sorting), i.e. "0000000001". See Queue Recipe for an > example > > > use of > > > > > > > this > > > > > > > > > feature. Note: the counter used to store the next sequence > > > number > > > > > is > > > > > > a > > > > > > > > > signed int (4bytes) maintained by the parent node, the > counter > > > will > > > > > > > > > overflow when incremented beyond 2147483647 (resulting in a > > > name > > > > > > > > > "-2147483648"). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please let me know if you have any comments or inputs. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >