leerho commented on issue #10: URL: https://github.com/apache/datasketches-rust/issues/10#issuecomment-3736016330
**Concern:** Having a central store of ".sk" files creates a risk of staleness and in effect a liability because it would not be constantly checking the correctness of the generation of the sk files by each language. Our current approach using Java and C++ as an example: - The Cross-Language Tests (CLT) are managed by specific GitHub Actions workflows - The the local CLT workflow associated with Java, loads the remote C++ language library and calls a C++ script to generate the ".sk" files into to a local directory at the root, "serialization_test_data". - The local CLT workflow then calls the local language (Java) which runs local tests against those generated files. Obviously, this tests both the generation code of the sk files by the remote language as well as the test code and sketch code of the local language. If there is a failure it could be from either local or remote, so it will need to be diagnosed. Thus the sk files are always fresh from the latest version of the remote language and the test code and sketch code is fresh from the local language. - Because all of this work actually occurs on GHA runners, once the tests have been completed the generated files are deleted at the end of the GH action. We also have developed some conventions to help make this work, especially since not every language implements all the same sketches. - The names of the sk files follow a pattern: "[sketch abbreviation]_[test abbreviation]_[source language].sk" This way, the local language only needs to load the *.sk and test against the sketches it knows about. - Because of the probabilistic nature of sketches, and the fact that there can be floating-point numeric precision differences between different platforms / languages, we do not require that a particular sk file generated by one language be bit-by-bit equivalent to the same sk file generated by different language. The tests allow for a given accuracy tolerance. This is by design. To make this work across more than two languages, every language would need to provide some common capabilities and conventions that can be implemented in a GHA workflow. - A script that generates all the sk files for the sketches it implements. - A script that loads sk files that it knows about and tests against them. - A means to skip testing of sketches that are not present in the generated files. This has yet to be implemented. - Use the same local root directory name: "serialization_test_data". - Use the same convention in naming the sk files: "[sketch abbreviation]_[test abbreviation]_[source language].sk" Because all the work is done using GHA, It is possible that all of the GHA "cross-language-testing" workflows could be centrally located either in one chosen language, or in a separate repo (released or not-released). This would avoid forcing every language to create another release every time a new sketch appears or changes in a different language. I don't believe we need to test the full N x N combination of languages, perhaps we choose for each language, one, or at most two other languages to test against. With a separate repo for the GHA cross-language workflows, every time a language does a new release, this repo would likely need to be updated, but I don't think it is necessary that repo be "released". Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
