Thanks!
On Sat, Mar 2, 2024 at 2:23 PM freakyzoidberg (via GitHub) <[email protected]> wrote: > > freakyzoidberg commented on code in PR #163: > URL: > https://github.com/apache/datasketches-website/pull/163#discussion_r1510085541 > > > ########## > docs/Architecture/LargeScale.md: > ########## > @@ -21,20 +21,47 @@ layout: doc_page > --> > ## Designed for Large-scale Computing Systems > > +#### Multiple Languages > + > +* The DataSketches library is now available in three languages, Java, > C++, and Python. A forth language, GoLang, is in development. > + > + > +### Compatibility Across Languages, Software Versions And Binary > Serialization Versions > +Large-scale computing environments may have a mix of various platforms > utilizing different programming languages each with the possiblity of using > different Software Versions of our DataSketches library. Cross version > compatibility of software is a challenge that all platforms face in > general, and it is up to the platform maintainers to keep their software > up-to-date. This not new and not different with the DataSketches library. > + > +Nonetheless, it our goal to strive to make it as easy as practically > possible to serialize our sketches in one of our supported languages on one > platform and to be deserialized in a different supported language, > potentially on a different, even remote platform, and perhaps much later in > time. > + > +With this goal in mind, here are some of the key strategic decisions we > have made in the development of the DataSketches library. > + > +#### Two levels of versioning. > + > +* **Software Version:** This is the release version, published via > Apache.org and specified in the POM file or equivalent. This can change > relatively frequently based on bug fixes and introduction of new > capabilities. We follow the principles of *Semantic Versioning* as > specified by [semver.org](https://semver.org). > + > +* **Serialization Version:** (*SerVer*) This is a small integer placed in > the preamble of the serialized byte array that indicates the version of the > serialized structure for the sketch. This is very similar to Java's [*Class > File Format Version*](https://en.wikipedia.org/wiki/Java_class_file). A > single *SerVer* may represent multiple structures all based on the same > sketch when stored in different states, e.g., *Single Item*, *Compact*, > *Updatable*, etc). This *SerVer* changes very rarely, if at all. Of all of > our sketches, only a few, e.g., Theta, KLL and Sampling, have had more than > one *SerVer* over time. There are and will be many *Software Versions* of > the same sketch that still use the same *SerVer*. When we have to update > the *SerVer*, we provide the capability in the *Software Version* of the > code associated with the new *SerVer* the ability to read and convert the > old *SerVer* to the new *SerVer*. This is why our newest *Software > Versions* can still read and interpret olde > r *SerVer* serialized sketches that go back to when our project was > started at Yahoo (2012), and before we went open-source (2015). Technically > speaking this can be described as *Backward-Transient* compatibility > [Schema Evolution and Compatibility]( > https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html) > and [Schema Evolution](https://en.wikipedia.org/wiki/Schema_evolution). > + > +From the user's perspective, as long as the *SerVer* is the same, older > *Software Versions* should be able to read sketch images created by newer > *Software Versions*. But the APIs may be different, obviously. An older > *Software Version* will not be able to take advantage of new features > introduced in new *Software Versions*, but it should be able to do what it > did before. In other words, there will be no loss of access to the > serialized sketch and the older *Software Version* capabilities. A user > should not need to access the *SerVer*, nonetheless it is always stored in > index one of the serialized image. If a sketch is presented with a *SerVer* > that it is not compatible with, the sketch should throw an exception and > say what the problem is, just like Java does with its *Class File Format > Versions*. > + > +#### The Serialized Image of a Sketch > +* The structure (or image) of a serialized sketch is independent of the > language from which it was created. > +* The sketch image only contains little-endian primitives, such as int64, > int32, int16, int8, double-64, float-32, UTF-8 strings, and simple array > structures of those, which can be easily interpreted in many languages on > modern CPUs. We do not support big-endian serialization. > +* The sketch image is unique for each type of sketch. > +* Simply speaking, a sketch image can be viewed as a blob of bytes, which > is easily stored and easily transported using many different protocols, > including Protobuf, Avro, Thrift, Byte64, etc. > + > > Review Comment: > Should we clarify that for some sketches (FrequencyItemSketch iirc) the > serialised form between language may not be strictly equal but still be > logically equivalent? > > > > > > ########## > docs/Architecture/LargeScale.md: > ########## > @@ -21,20 +21,47 @@ layout: doc_page > --> > ## Designed for Large-scale Computing Systems > > +#### Multiple Languages > + > +* The DataSketches library is now available in three languages, Java, > C++, and Python. A forth language, GoLang, is in development. > > Review Comment: > `A fourth language, Go, is in development.` > > Typo in forth/fourth > GoLang is the former name of Go https://go.dev/doc/faq#go_or_golang > > > > -- > This is an automated message from the Apache Git Service. > To respond to the message, please log on to GitHub and use the > URL above to go to the specific comment. > > To unsubscribe, e-mail: [email protected] > > For queries about this service, please contact Infrastructure at: > [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
