Great, I'll be submitting the pull request shortly. The codebase I'm working with doesn't have any of the changes made in the past week or so, hopefully that isn't too much of a hassle to merge.
As an aside, my employer encourages us to contribute code to libraries like this, so I'm happy to work on additional features for the Python interface as needed. Michael ________________________________ From: Jon Malkin <[email protected]> Sent: Thursday, May 14, 2020 6:56 PM To: [email protected] <[email protected]> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software We've been polishing things up for a release, so that was one of several things that we fixed over the last several days. Thank you for finding it! Anyway, if you're generally happy with the state of things (and are allowed to under any employment terms), I'd encourage you to create pull request to merge your changes into the main repo. It doesn't need to be perfect as we can always make changes as part of the PR review or post-merge. Thanks, jon On Mon, May 11, 2020 at 2:25 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Thanks for taking a look, Jon. I pushed an update that address 2 & 4. #3 is actually something I had a question about. I've tested passing numpy.nan into the update function, and it doesn't appear to break anything (min, max, etc all still work correctly). However, the reported number of items per sketch counts the nan entries. Is this the expected behavior, or should the get_n() method return a number that does not count the nans it has seen? I expected the latter, so I'm worried that numpy's nan is being treated differently. Michael ________________________________ From: Jon Malkin <[email protected]<mailto:[email protected]>> Sent: Monday, May 11, 2020 4:32 PM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software I didn't look in super close detail, but the code overall looks pretty good. Comments are below. Note that not all of these necessarily need changes or replies. I'm just trying to document things we'll want to think about for keeping the library general-purpose (and we can always make changes after merging, of course). 1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe vector_kll_sketches? But if there's a way to extend KLL in the future to operate on an entire vector at a time (vs treating each dimension independently) that'd become confusing. I think an inherently vectorized version would be a very different beast, but I always worry I'm not being imaginative enough. If merging into the Apache codebase, I'd probably wait to see what the file looks like with the renaming before a final decision on moving to its own file. 2. What happens if the input to update() has >2 dimensions? If that'd be invalid, we should explicitly check and complain. If it'll Do The Right Thing by operating on the first 2 dimensions (meaning correct indices) that's fine, but otherwise should probably complain. 3. Can this handle sparse input vectors? Not sure how important that is in general, even if your project doesn't require it. kll_sketch will ignore NaNs, so those appearing would mean the number of items per sketch can already differ. 4. I'd probably eat the very slightly increased space and go with 32 bits for the number of dimensions (aka number of sketches). If trying to look at a distribution of values for some machine learning application, it'd be easy to overflow 65k dimensions for some tasks. 5. I imagine you've realized that it's easiest to do unit tests from python in this case. That's another advantage of having this live in the wrapper. 6. Finally, that assert issue is already obsolete :). Asserts were converted if/throw exceptions late last week. It'll be flagged as a conflict in merging, so no worries for now. Looking good at this point. And as I said, not all of these need changes or comments from you. jon On Mon, May 11, 2020 at 7:09 AM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- I'll leave it to you to decide if it's better as its own file. Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws errors regarding the assert calls in kll_sketch_impl.hpp. I added an include of assert.h there and then it compiled without issue. It's possible that other compilers will also complain about that, so maybe this is a good update to the main branch. Michael ________________________________ From: Jon Malkin <[email protected]<mailto:[email protected]>> Sent: Sunday, May 10, 2020 10:47 PM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software My only comment without having looked at actual code is that the new class would be more appropriate in the python wrapper. Maybe even drop it in as it's own file, as that would decrease recompile time a bit when debugging (that's pybind's suggestion, anyway). Probably not a huge difference with how light these wrappers are. If this is something that becomes widely used, to where we look at pushing it into the base library, we'd look at whether we could share any data across sketches. But we're far from that point currently. It'd be nice to need to consider that. jon On Sun, May 10, 2020, 7:33 PM leerho <[email protected]<mailto:[email protected]>> wrote: Michael, this has been a great interchange and certainly will allow us to move forward more quickly. Thank you for working on this on a Mother's Day Sunday! I'm sure Alex and Jon may have more questions, when they get a chance to look at it starting tomorrow. Cheers, and be safe and well! Lee. On Sun, May 10, 2020 at 6:25 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Re: testing, so far I've just done glorified unit tests for uniform and normal distributions of varying sizes. I plan to do some timing tests vs the existing single-sketch Python class to see how it compares for 1, 10, and 100 streams. 1. That makes sense. One option to allow full Numpy compatibility but without requiring a Python user to use Numpy would be to return everything as lists, rather than Numpy arrays. Numpy users could then convert those lists into arrays, and non-Numpy users would be unaffected (aside from needing the pybind11/numpy.h header). Alternatively, some flag could be set when instantiating the object that would control whether things are returned as lists or arrays, though this still requires the numpy.h header file. 2. I didn't change the kll_sketch code, I only defined a new (wrapper) class called kll_sketches, which spawns a user-specified number of sketches. Each of those sketches are kll_sketch objects and uses all of the existing code for that. For fast execution in Python, the parallel sketches must be spawned in C++, but the existing Python object could only spawn a single sketch since it wraps the kll_sketch class. Perhaps the kll_sketches class would be better placed in the python/src/kll_wrapper.cpp file? I suppose you wouldn't need this class if you weren't using Python. 3. Yes, SerDe is very straight-forward here. I've marked some stuff as todo's, and that is one of them -- the plan is to do like you described and call the relevant kll_sketch method on each of the sketches and return that to Python in a sensible format. For deserialization, it would just iterate through them and load them into the kll_sketches object. I don't require it for my project, so I didn't bother to wrap that yet -- I'll take a look sometime this week after I finish my work for the day, shouldn't take long to do. 4. That makes sense. Does using Numpy complicate that at all? My thought is that since under the hood everything is using the existing kll_sketch class, it would have full compatibility with the rest of the library (once SerDe is added in). Michael ________________________________ From: leerho <[email protected]<mailto:[email protected]>> Sent: Sunday, May 10, 2020 8:42 PM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software Thanks for the link to your code. My colleagues, Jon and Alex, will take a closer look this next week. They wrote this code so they are much closer to it than I. What you have done so far makes sense for you as you want to get this working in the NumPy environment as quickly as possible. As soon as we start thinking about incorporating this into our library other concerns become important. 1. Adding API calls is the recommended way to add functionality (like NumPy) to a library. We cannot change API calls in a way that is only useful with NumPy, because it would seriously impact other users of the library that don't need NumPy. If both sets of calls cannot simultaneously exist in the same sketch API, then we need to consider other alternatives. 2. Based on our previous discussions, I didn't envision that you would have to change the kll_sketch code itself other than perhaps a "wrapper" class that enables vectorized input to a vector of sketches and a vectorized get result that creates a vector result from a vector of sketches. This would isolate the changes you need for NumPy from the sketch itself. This is also much easier to support, maintain and debug. 3. If you don't change the internals of the sketch then SerDe becomes pretty straightforward. I don't know if you need a single serialization that represents a full vector of sketches, but if you do, then I would just iterate over the individual serdes and figure out how to package it. I really don't think you want to have to rewrite this low-level stuff. 4. Binary compatibility is critically important for us and I think will be important for you as well. There are two dimensions of binary compatibility: history and language. This means that a kll sketch serialized from Java, can be successfully read by C++ and visa versa. Similarly, a kll sketch serialized today will be able to be read many years from now. Another aspect of this would mean being able to collect, say, 100 sketches that were not created using the NumPy version, and being able to put them together in a NumPy vector; and visa versa. I hope all of this make sense to you. Cheers, Lee. On Sun, May 10, 2020 at 4:21 PM leerho <[email protected]<mailto:[email protected]>> wrote: Michael, This is great! What testing have you been able to do so far? On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Lee, Thanks for all of that information, it's quite helpful to get a better understanding of things. I've put the code on Github if you'd like to take a look: https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225387903&sdata=srvRz99iYMtAVoOCi51WCTOTFFzA4zJT%2BAU78fFKHcE%3D&reserved=0> Changes are - new class in kll/include/kll_sketch.hpp, w/ associated constructor in kll/include/kll_sketch_impl.hpp. This class spawns a specified number of sketches. - new Python interface functions in python/src/kll_wrapper.cpp The only new dependency introduced is the pybind11/numpy.h header file. The new Numpy-compatible Python classes retain identical functionality to the existing classes (with minor changes to method names, e.g., get_min_value --> get_min_values), except that I have not yet implemented merging or (de)serialization. These would be straight-forward to implement, if needed. Re: characterization tests, I'll take a look at those tests you linked to and see about running them, time and compute resources permitting. Michael ________________________________ From: leerho <[email protected]<mailto:[email protected]>> Sent: Sunday, May 10, 2020 5:32 PM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software Michael, Is there a place on GitHub somewhere where I could look at your code so far? The reason I ask, is before you do a PR, we would like to determine where a contribution such as this should be placed. Our library is split up among different repositories, determined by language and dependencies. This keeps the user downloads smaller and more focused. We have two library repos for the core sketch algorithms, one for Java and one for C++/Python, where the dependencies are very lean, which simplifies integration into other systems. We have separate repos for adaptors, which depend on one of the core repos. On the Java side, we have separate repos for adaptors for Apache Hive and Apache Pig, as the dependencies for each of these are quite large. For C++, we have a dedicated repo for the adaptors for PostgreSQL. Some of our adaptors are hosted with the target system. For example, our Druid adaptors were contributed directly into Apache Druid. I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It is not clear to me at the moment whether we should create a separate repo for this or have a separate group of directories in our cpp repo. **** We have a separate repo for our characterization code, which is not formally "released" as an Apache release. It exists because we want others to be able to reproduce (or challenge) our claims of speed performance or accuracy. It is the one repo where we have all languages and many different dependencies. The coding style is not as rigorous or as well documented as our repos that do have formal releases. Characterization testing is distinctly different from Unit Tests, which basically checks all the main code paths and makes sure that the program works as it should. The key metric is code coverage and Unit Tests should be fast as it is run on every check-in of new code. Characterization is also different from Integration Testing, which is testing how well the code works when integrated into larger systems. Characterization tests are unique to our kind of library. Because our algorithms are probabilistic in nature, in order to verify accuracy or speed performance we need to run many thousands of trials to eliminate statistical noise in the results. And when the data is large, this can take a long time. You can peruse our website for many examples as all the plots result from various characterization studies. What appears on the website is but a small fraction of all the testing we have done. There are no "standard" tests as every sketch is different so we have to decide what is important to measure for a particular sketch, but the basic groups are speed and accuracy. For speed there are many possible measurements, but the basic ones are update speed, merge speed, Serialization / Deserialization speed, get estimate or get result speeds. For accuracy we want to validate that the sketch is performing within the bounds of the theoretical error distribution. We want to measure this accuracy in the context of a stand-alone, purely streaming sketch and also in the context of merging many sketches together. We also try to do these same tests comparing the results against other alternatives users might have. We have performed these same characterizations on other publically available sketches as well as against traditional, brute-force approaches to solving the same problem. For the solution you have developed, we would depend on you to decide what properties would be most important to characterize for users of this solution. It should be very similar to what you would write in a paper describing this solution; you want to convince the reader that this is very useful and why. Since the first sketch you have leveraged is the KLL quantiles sketch, I would think you would want some characterizations similar to what we did for our studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=E1M4tMTmRum%2BXcUnv5ikiIRTGEIOMRoSygkdyC%2FZlbM%3D&reserved=0> comparing our older quantiles sketch and the KLL sketch. **** For the Java characterization tests, we have "standardized" on having small configuration files which define the key parameters of the test. These are simple text files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225397900&sdata=LjYOjw8GQo6vUIdgwNCi2GLZUZiHdimrggzgZUsZW40%3D&reserved=0> of key-value pairs. We don't have any centralized definition of these pairs, just that they are human readable and intelligible. They are different for each type of sketch. For the C++ tests, we don't have a collection of config files yet (this is one of our TODOs), but the same kind of parameters are set in the code itself. We will likely want to set up a separate directory for your characterization tests. I hope you find this helpful. Cheers, Lee. On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]<mailto:[email protected]>> wrote: The code is in a good state now. It can take individual values, lists, or Numpy arrays as input, and it returns back Numpy arrays. There are some additional features, like being able to specify which sketches the user wants to, e.g., get quantiles for. But, I have only done minor testing with uniform and normal distributions. I'd like to put it through more extensive testing (and some documentation) before releasing it, and it sounds like your characterization tests are the way to go -- it's not science if it's not reproducible! Is there a standard set of tests for this purpose? If not, are there standard tests that have been used for the existing codebase? Michael ________________________________ From: leerho <[email protected]<mailto:[email protected]>> Sent: Saturday, May 9, 2020 7:21 PM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software This is great. The first step is to get your project working! Once you think you are ready, it would be really useful if you could do some characterization testing in the NumPy environment. Characterization tests are what we run to fully understand how a sketch performs over a range of parameters and using thousands to millions of trials. You can see some of the accuracy and speed performance plots of various sketches on our website. Sometimes these can take hours to run. We typically use synthetic data to drive our characterization tests to make them reproducible. Real data can also be used and one comparison test I would recommend is comparing how long it takes to get approximate results using sketches versus how long it would take to get exact results using brute force methods. The bigger the data set is the better :) We don't have much experience with NumPy so this will be a new environment for us. But before you get too deep into this please get us involved. We have been characterizing these streaming algorithms for a number of years, and would like to help you. Cheers, Lee. On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: I'm not quite sure what being a committer entails, but yeah I'm happy to contribute. I can't commit a lot of time to working on it, but with how things went for KLL I don't think it will take a lot of time for the other sketches if they are formatted in a similar manner. Getting this library integrated into numpy/scipy would be awesome, I'm sure I could get some others in my field to begin using it. Michael ________________________________ From: Lee Rhodes <[email protected]<mailto:[email protected]>> Sent: Saturday, May 9, 2020 5:06 PM To: Michael Himes <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software This is just awesome! Would you be interested in becoming a committer on our project? It is not automatic, but we could work with you to bring you up to speed on the other sketches in the library. If you could help us integrate DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it would be a very significant contribution and we would definitely want you to be part of our community! Thanks, Lee. On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Hi Lee, Thanks for the notice, I went ahead and subscribed to the list. As for Jon's email, this is actually what I have currently implemented! Once I finish ironing out a couple improvements, I'm going to move some code around to follow the existing coding style, put it on Github, and submit a pull request. Michael ________________________________ From: Lee Rhodes <[email protected]<mailto:[email protected]>> Sent: Saturday, May 9, 2020 4:22 PM To: Michael Himes <[email protected]<mailto:[email protected]>> Subject: Fwd: Permission to use KLL streaming-quantiles code in free open-source academic software Hi Michael, I don't think you saw this email as I doubt you are subscribed to our [email protected]<mailto:[email protected]> email list. We would like to have you as part of our larger community, as others might also have suggestions on how to move your project forward. You can subscribe by sending an empty email to [email protected]<mailto:[email protected]>. Lee. ---------- Forwarded message --------- From: Jon Malkin <[email protected]<mailto:[email protected]>> Date: Thu, May 7, 2020 at 4:11 PM Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software To: <[email protected]<mailto:[email protected]>> Cc: Lee Rhodes <[email protected]<mailto:[email protected]>>, Edo Liberty <[email protected]<mailto:[email protected]>>, [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> We're using pybind11 to get a C++ interface with python (vs raw C). The wrappers themselves are quite thin, but they do have examples of calling functions defined in the wrapper as opposed to only the sketch object. I believe the easiest way to do this will be to define a pretty simple C++ object and create a pybind wrapper for it. That object would contain a std::vector<kll_sketch>. Then you'd define an update method for your custom object that iterates through a numpy array and calls update() on the appropriate sketch. You'd also want to define something similar for get_quantile() or whatever other methods you need that iterates through that vector of sketches and returns the result in a numpy array. That's a pretty lightweight object. And then you'd use a similar thin pybind wrapper around it to make it play nicely with python. Since our C++ library is just templates, you'd end up with a free-standing library, with no requirement that the base datasketches library be involved. jon On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: I would be happy to share whatever I come up with (if anything). The lack of a Numpy/Scipy implementation is what led me to the DataSketches library, it would be very useful to myself and others if it were a part of Numpy/Scipy. For what it's worth, passing in a Numpy array and manipulating it from the C++ side is quite easy. On the other hand, figuring out how to spawn m sketches and pass the values along to that looks like it'll be more challenging, there is a lot of code here and it'll take some time for me to familiarize myself with it. Michael ________________________________ From: Lee Rhodes <[email protected]<mailto:[email protected]>> Sent: Thursday, May 7, 2020 12:00 PM To: Michael Himes <[email protected]<mailto:[email protected]>> Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software If you do figure out how to do this, it would be great if you could share it with us. We would like to extend it to other sketches and submit it as an added functionality to NumPy. I have been looking at the NomPy and SciPy libraries and have not found anything close to what we have. Lee. On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Hi Lee, Jon, Thanks for the information. I tried to vectorize things this morning and ran into that exact problem -- since the offsets can differ, it leads to slices of different lengths, which wouldn't be possible to store as a single Numpy array. Lee, your understanding of my problem is spot on. n vectors of size m, where all m elements of each vector are a float (no NaNs or missing values). I am interested in quantiles at rank r for each of the m streams. Only 1 sketch will operate simultaneously, saving/loading the sketch is not required (though it would be a nice feature), and sketches would not need to be merged (no serialization/deserialization). Not surprisingly, it looks like your original suggestion of handling this on the C++ side is the way to go. Once I have time to dive into the code, my plan is to write something that implements what you described in the earlier email. Thanks, Michael ________________________________ From: Lee Rhodes <[email protected]<mailto:[email protected]>> Sent: Wednesday, May 6, 2020 10:43 PM To: Michael Himes <[email protected]<mailto:[email protected]>> Cc: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>>; Edo Liberty <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software Michael, One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work for another reason and that is for each dimension, choosing whether to delete the odd or even values in the compactor must be random and independent of the other dimensions. Otherwise you might get unwanted correlation effects between the dimensions. This is another argument that you should have independent compactors for each dimension. So you might as well stick with individual sketches for each dimension. Lee. On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]<mailto:[email protected]>> wrote: Michael, Allow me to back up for a moment to make sure I understand your problem. You have a large number of large vectors of the form V_n = {x_i}: n vectors of size m, where x is a number and x_i is the ith element, or equivalently, the ith dimension. Assumptions: * All vectors, V, are of the same size m. * All elements, x_i, are valid numbers of the same type. No missing values, and if you are using floats, this means no NaNs. In aggregate, the n vectors represent m independent distributions of values. Your task is to be able to obtain m quantiles at rank r in a single query. **** To do this, using your idea, would require vectorization of the entire sketch and not just the compactors. The inputs are vectors, the result of operations such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are also vectors. This sketch will be a large data structure, which leads to more questions ... * Do you anticipate having many of these vectorized sketches operating simultaneously? * Is there any requirement to store and later retrieve this sketch? * Or, the nearly equivalent question: Do you require merging of these sketches (across clusters, for example)? Which also means serialization and deserialization. I am concerned that this vector-quantiles sketch would be limited in the sense that it may not be as widely applicable as it could be. Our experience with real data is that it is ugly with missing values, NaN, nulls, etc. Which means we would not be able to vectorize the compactor. Each dimension i would need a separate independent compactor because the compaction times will vary depending on missing values or NaNs in the data. Spacewise, I don't think having separate independent sketches for each dimension would be much smaller than vectorizing the entire sketch, because the internals of the existing sketch are already quite space efficient leveraging compact arrays, etc. As a first step I would favor figuring out how to access the NumPy data structure on the C++ side, having individual sketches for each dimension, and doing the iterations updating the sketches in C++. It also has the advantage of leveraging code that exists and it would automatically be able to leverage any improvements to the sketch code over time. In addition, it could be a prototype of how to integrate other sketches into the NumPy ecosystem. A fully vectorized sketch would be a separate implementation and would not be able to take advantage of these points. Lee. On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Hi Lee, I don't think there is a problem with the DataSketches library, just that it doesn't support what I am trying to do -- looking in the documentation, it only supports streams of ints or floats, and those situations work fine for me. Here's what I did: - began with the KLL test .py file: https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=AAyQLO7aJVSOsZTMzfF84WhXjDyM%2B8ZWd2pXRdDOdUE%3D&reserved=0> - replaced line 30 with kll.update(np.ones(10) * randn()) to have a Numpy array of 10 identical values. - ran the code This leads to the following error, as expected: TypeError: update(): incompatible function arguments. The following argument types are supported: 1. (self: datasketches.kll_floats_sketch, item: float) -> None Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) It's not coded to support Numpy arrays, therefore it complains. What I would ideally like to have happen in this scenario is it would treat each element in the array as a separate stream. Then, later when getting a given quantile, it would give 10 values, one for each stream. I don't see an easy approach to implementing this on the Python side besides a very slow iterative approach, and admittedly my C++ is quite rusty so I haven't looked into the codebase to see how I might modify things there to support this functionality. Re: the streaming-quantiles code being easily modified, I believe the only necessary changes would be changing the Compactor class to be a subclass of numpy.ndarray, rather than list, and implementing methods for the list-specific methods that are used, like .append(). Then, it isn't necessary to loop over the streams since we can make use of Numpy's broadcasting, which will handle the looping in its C++ code, as you mentioned. I'll work on this and see if it really is as straight-forward as it seems. If you have any advice on how to use DataSketches for my problem, I'm certainly open to that. Thanks, Michael ________________________________ From: Lee Rhodes <[email protected]<mailto:[email protected]>> Sent: Wednesday, May 6, 2020 4:37 PM To: Michael Himes <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software Michael, Thank you for considering the DataSketches library. I am adding this thread to our [email protected]<mailto:[email protected]> so that our whole team can contribute to finding a solution for you. WRT the error you experienced, please help us help you by sharing with us what the exact error was. We are about to release a major upgrade to the DataSketches C++/Python product in the next few weeks. We have fixed a number of stability issues and bugs, which may solve the problem. Nonetheless, we want to work with you to get your problem solved. Updating 1e5 sketches in a system is not a problem in Java or C++. We have real-time systems today that generate and process over 1e9 sketches every day. Unfortunately our experience tells us that looping in Python code will be 10 to 100 times slower than Java or C++. This is because the code would have to switch from Python to C++ for every vector element. By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors. I would like to understand more about what you have in mind that would be "easily modified". NumPy achieves its speed performance by doing all of the matrix operations in pre-compiled C++ code. To achieve best performance, we would want to read and loop through the NumPy data structure on the C++ side leveraging the C++ DataSketches library directly. I am not sure what would be involved to actually accomplish that. But first we need to get your Python + NumPy code working correctly with our library so we can find out what its actual performance is. Cheers, Lee. On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]<mailto:[email protected]>> wrote: Hi Edo, Lee, Thanks for the prompt response. I looked at the datasketches library, and while it seems to have a lot more features, it looks like it'll be a lot more difficult to get it to work for my desired use case. My problem is that I need quantiles for each element of a vector (length on the order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, but it throws an error, so it doesn't seem like datasketches handles this situation currently. To use datasketches, I think I would need to instantiate 1 object per vector element, and I suspect this will slow things down considerably due to iterating over the objects when each vector is processed. By comparison, the streaming-quantiles code could be easily modified to use Numpy arrays and operate on vectors. I ran a few unit tests on both codes and found equivalent behavior, as expected. Do you have any recommendation(s) for this situation? Are there known limitations of the streaming-quantiles code that would cause issues for my use case? Are the other methods offered in datasketches 'better' than the KLL implemented in streaming-quantiles? I'm quite out of my area of expertise, so I appreciate any advice you can offer, and I will of course acknowledge it in the publication. Best, Michael ________________________________ From: Edo Liberty <[email protected]<mailto:[email protected]>> Sent: Tuesday, May 5, 2020 8:09 PM To: Lee Rhodes <[email protected]<mailto:[email protected]>>; Michael Himes <[email protected]<mailto:[email protected]>> Cc: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: Re: Permission to use KLL streaming-quantiles code in free open-source academic software +Lee Hi Michael, Thanks for reaching out. While you can certainly do that, I recommend using the python-Binded datasketches library. It will be more robust, faster, and bug free than my code :) On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]<mailto:[email protected]>> wrote: Hi Edo, I'm currently working on a Python package for machine-learning-accelerated exoplanet modeling. It is free and open source (see here if you're curious https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C87f6dcd63e044eb182d608d7f85a1bf1%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637250938225407892&sdata=s2C1taJAzPJmVhheL3if8KWCIDb8fSyr6559GTGdHpU%3D&reserved=0>), and it's meant purely for reproducible academic research. I'm adding some new features to the software, and one of them requires computing quantiles for a data set that cannot fit into memory. After searching around for different methods to do this, your KLL method seemed to be a good option in terms of speed and space requirements. Rather than reinvent the wheel and code my own implementation of the method from scratch, I was wondering if you'd be willing to allow me to use your code? I don't see a license, so I wanted to make sure you're okay with this. I could implement it as a submodule within my repo, or I could only include the kll.py file and add some additional comments pointing to your repo and such, whichever you prefer. Best, Michael -- >From my cell phone.
