Michael, This is great! What testing have you been able to do so far?
On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]> wrote: > Lee, > > Thanks for all of that information, it's quite helpful to get a better > understanding of things. > > I've put the code on Github if you'd like to take a look: > https://github.com/mdhimes/incubator-datasketches-cpp > > Changes are > - new class in kll/include/kll_sketch.hpp, w/ associated constructor in > kll/include/kll_sketch_impl.hpp. This class spawns a specified number of > sketches. > - new Python interface functions in python/src/kll_wrapper.cpp > > The only new dependency introduced is the pybind11/numpy.h header file. > The new Numpy-compatible Python classes retain identical functionality to > the existing classes (with minor changes to method names, e.g., > get_min_value --> get_min_values), except that I have not yet implemented > merging or (de)serialization. These would be straight-forward to > implement, if needed. > > Re: characterization tests, I'll take a look at those tests you linked to > and see about running them, time and compute resources permitting. > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Sunday, May 10, 2020 5:32 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Is there a place on GitHub somewhere where I could look at your code so > far? The reason I ask, is before you do a PR, we would like to determine > where a contribution such as this should be placed. > > Our library is split up among different repositories, determined by > language and dependencies. This keeps the user downloads smaller and more > focused. We have two library repos for the core sketch algorithms, one > for Java and one for C++/Python, where the dependencies are very lean, > which simplifies integration into other systems. We have separate repos > for adaptors, which depend on one of the core repos. On the Java side, we > have separate repos for adaptors for Apache Hive and Apache Pig, as the > dependencies for each of these are quite large. For C++, we have a > dedicated repo for the adaptors for PostgreSQL. > > Some of our adaptors are hosted with the target system. For example, our > Druid adaptors were contributed directly into Apache Druid. > > I assume your code has dependencies on Python, NumPy and DataSketches-cpp. > It is not clear to me at the moment whether we should create a separate > repo for this or have a separate group of directories in our cpp repo. > > **** > We have a separate repo for our characterization code, which is not > formally "released" as an Apache release. It exists because we want others > to be able to reproduce (or challenge) our claims of speed performance or > accuracy. It is the one repo where we have all languages and many > different dependencies. The coding style is not as rigorous or as well > documented as our repos that do have formal releases. > > Characterization testing is distinctly different from Unit Tests, which > basically checks all the main code paths and makes sure that the program > works as it should. The key metric is code coverage and Unit Tests should > be fast as it is run on every check-in of new code. Characterization is > also different from Integration Testing, which is testing how well the code > works when integrated into larger systems. > > Characterization tests are unique to our kind of library. Because our > algorithms are probabilistic in nature, in order to verify accuracy or > speed performance we need to run many thousands of trials to eliminate > statistical noise in the results. And when the data is large, this can > take a long time. You can peruse our website for many examples as all the > plots result from various characterization studies. What appears on the > website is but a small fraction of all the testing we have done. > > There are no "standard" tests as every sketch is different so we have to > decide what is important to measure for a particular sketch, but the basic > groups are *speed* and *accuracy*. > > For speed there are many possible measurements, but the basic ones are > update speed, merge speed, Serialization / Deserialization speed, get > estimate or get result speeds. > > For accuracy we want to validate that the sketch is performing within the > bounds of the theoretical error distribution. We want to measure this > accuracy in the context of a stand-alone, purely streaming sketch and also > in the context of merging many sketches together. > > We also try to do these same tests comparing the results against other > alternatives users might have. We have performed these same > characterizations on other publically available sketches as well as against > traditional, brute-force approaches to solving the same problem. > > For the solution you have developed, we would depend on you to decide what > properties would be most important to characterize for users of this > solution. It should be very similar to what you would write in a paper > describing this solution; you want to convince the reader that this is > very useful and why. > > Since the first sketch you have leveraged is the KLL quantiles sketch, I > would think you would want some characterizations similar to what we did > for our studies > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083763402&sdata=8hoD1LWVL857LCe0nJtGFeF%2FHNk4yVKm%2BYhEuWAtaTc%3D&reserved=0> > comparing our older quantiles sketch and the KLL sketch. > > **** > For the Java characterization tests, we have "standardized" on having > small configuration files which define the key parameters of the test. > These are simple text files > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=%2FbROUq%2B1hItSCprW8BziTzQbM3%2BCHAWInQnXHSDFWeM%3D&reserved=0> > of key-value pairs. We don't have any centralized definition of these > pairs, just that they are human readable and intelligible. They are > different for each type of sketch. > > For the C++ tests, we don't have a collection of config files yet (this is > one of our TODOs), but the same kind of parameters are set in the code > itself. > > We will likely want to set up a separate directory for your > characterization tests. > > I hope you find this helpful. > > Cheers, > > Lee. > > On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]> > wrote: > > The code is in a good state now. It can take individual values, lists, or > Numpy arrays as input, and it returns back Numpy arrays. There are some > additional features, like being able to specify which sketches the user > wants to, e.g., get quantiles for. > > But, I have only done minor testing with uniform and normal > distributions. I'd like to put it through more extensive testing (and some > documentation) before releasing it, and it sounds like your > characterization tests are the way to go -- it's not science if it's not > reproducible! Is there a standard set of tests for this purpose? If not, > are there standard tests that have been used for the existing codebase? > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Saturday, May 9, 2020 7:21 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is great. The first step is to get your project working! Once you > think you are ready, it would be really useful if you could do some > characterization testing in the NumPy environment. Characterization tests > are what we run to fully understand how a sketch performs over a range of > parameters and using thousands to millions of trials. You can see some of > the accuracy and speed performance plots of various sketches on our > website. Sometimes these can take hours to run. We typically use > synthetic data to drive our characterization tests to make them > reproducible. > > Real data can also be used and one comparison test I would recommend is > comparing how long it takes to get approximate results using sketches > versus how long it would take to get exact results using brute force > methods. The bigger the data set is the better :) > > We don't have much experience with NumPy so this will be a new environment > for us. But before you get too deep into this please get us involved. We > have been characterizing these streaming algorithms for a number of years, > and would like to help you. > > Cheers, > > Lee. > > On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]> > wrote: > > I'm not quite sure what being a committer entails, but yeah I'm happy to > contribute. I can't commit a lot of time to working on it, but with how > things went for KLL I don't think it will take a lot of time for the other > sketches if they are formatted in a similar manner. Getting this library > integrated into numpy/scipy would be awesome, I'm sure I could get some > others in my field to begin using it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 5:06 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is just awesome! Would you be interested in becoming a committer on > our project? It is not automatic, but we could work with you to bring you > up to speed on the other sketches in the library. If you could help us > integrate DataSketches into NumPy and possibly SciPy (not sure if this is > necessary) it would be a very significant contribution and we would > definitely want you to be part of our community! > > Thanks, > > Lee. > > On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > Thanks for the notice, I went ahead and subscribed to the list. > > As for Jon's email, this is actually what I have currently implemented! > Once I finish ironing out a couple improvements, I'm going to move some > code around to follow the existing coding style, put it on Github, and > submit a pull request. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 4:22 PM > *To:* Michael Himes <[email protected]> > *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Hi Michael, > I don't think you saw this email as I doubt you are subscribed to our > [email protected] email list. > > We would like to have you as part of our larger community, as others might > also have suggestions on how to move your project forward. > You can subscribe by sending an empty email to > [email protected]. > > Lee. > > ---------- Forwarded message --------- > From: *Jon Malkin* <[email protected]> > Date: Thu, May 7, 2020 at 4:11 PM > Subject: Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > To: <[email protected]> > Cc: Lee Rhodes <[email protected]>, Edo Liberty < > [email protected]>, [email protected] <[email protected]> > > > We're using pybind11 to get a C++ interface with python (vs raw C). The > wrappers themselves are quite thin, but they do have examples of calling > functions defined in the wrapper as opposed to only the sketch object. > > I believe the easiest way to do this will be to define a pretty simple C++ > object and create a pybind wrapper for it. That object would contain a > std::vector<kll_sketch>. Then you'd define an update method for your > custom object that iterates through a numpy array and calls update() on the > appropriate sketch. You'd also want to define something similar for > get_quantile() or whatever other methods you need that iterates through > that vector of sketches and returns the result in a numpy array. > > That's a pretty lightweight object. And then you'd use a similar thin > pybind wrapper around it to make it play nicely with python. Since our C++ > library is just templates, you'd end up with a free-standing library, with > no requirement that the base datasketches library be involved. > > jon > > On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]> > wrote: > > I would be happy to share whatever I come up with (if anything). The lack > of a Numpy/Scipy implementation is what led me to the DataSketches library, > it would be very useful to myself and others if it were a part of > Numpy/Scipy. > > For what it's worth, passing in a Numpy array and manipulating it from the > C++ side is quite easy. On the other hand, figuring out how to spawn m > sketches and pass the values along to that looks like it'll be more > challenging, there is a lot of code here and it'll take some time for me to > familiarize myself with it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Thursday, May 7, 2020 12:00 PM > *To:* Michael Himes <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]>; [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > If you do figure out how to do this, it would be great if you could share > it with us. We would like to extend it to other sketches and submit it as > an added functionality to NumPy. I have been looking at the NomPy and > SciPy libraries and have not found anything close to what we have. > > Lee. > > > On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]> > wrote: > > Hi Lee, Jon, > > Thanks for the information. I tried to vectorize things this morning and > ran into that exact problem -- since the offsets can differ, it leads to > slices of different lengths, which wouldn't be possible to store as a > single Numpy array. > > Lee, your understanding of my problem is spot on. n vectors of size m, > where all m elements of each vector are a float (no NaNs or missing > values). I am interested in quantiles at rank r for each of the m > streams. Only 1 sketch will operate simultaneously, saving/loading the > sketch is not required (though it would be a nice feature), and sketches > would not need to be merged (no serialization/deserialization). > > Not surprisingly, it looks like your original suggestion of handling this > on the C++ side is the way to go. Once I have time to dive into the code, > my plan is to write something that implements what you described in the > earlier email. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 10:43 PM > *To:* Michael Himes <[email protected]> > *Cc:* [email protected] <[email protected]>; Edo > Liberty <[email protected]>; [email protected] <[email protected]> > > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not > work for another reason and that is for each dimension, choosing whether to > delete the odd or even values in the compactor must be random and > independent of the other dimensions. Otherwise you might get unwanted > correlation effects between the dimensions. > > This is another argument that you should have independent compactors for > each dimension. So you might as well stick with individual sketches for > each dimension. > > Lee. > > On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]> > wrote: > > Michael, > > Allow me to back up for a moment to make sure I understand your problem. > > You have a large number of large vectors of the form *V_n = {x_i}:* *n* > vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th > element, or equivalently, the *i*th dimension. > > Assumptions: > > - All vectors, *V*, are of the same size *m.* > - All elements, *x_i*, are valid numbers of the same type. No missing > values, and if you are using *floats*, this means no *NaN*s. > > In aggregate, the *n* vectors represent *m* *independent* distributions > of values. > > Your task is to be able to obtain *m* quantiles at rank *r* in a single > query. > > **** > To do this, using your idea, would require vectorization of the entire > sketch and not just the compactors. The inputs are vectors, the result of > operations such as getQuantile(r), getQuantileUpperBound(r), > getQuantileLowerBound(r), are also vectors. > > This sketch will be a large data structure, which leads to more questions > ... > > - Do you anticipate having many of these vectorized sketches operating > simultaneously? > - Is there any requirement to store and later retrieve this sketch? > - Or, the nearly equivalent question: Do you require merging of these > sketches (across clusters, for example)? Which also means serialization > and deserialization. > > I am concerned that this vector-quantiles sketch would be limited in the > sense that it may not be as widely applicable as it could be. > > Our experience with real data is that it is ugly with missing values, NaN, > nulls, etc. Which means we would not be able to vectorize the compactor. > Each dimension *i* would need a separate independent compactor because > the compaction times will vary depending on missing values or NaNs in the > data. > > Spacewise, I don't think having separate independent sketches for each > dimension would be much smaller than vectorizing the entire sketch, because > the internals of the existing sketch are already quite space efficient > leveraging compact arrays, etc. > > As a first step I would favor figuring out how to access the NumPy data > structure on the C++ side, having individual sketches for each > dimension, and doing the iterations updating the sketches in C++. It also > has the advantage of leveraging code that exists and it would automatically > be able to leverage any improvements to the sketch code over time. In > addition, it could be a prototype of how to integrate other sketches into > the NumPy ecosystem. > > A fully vectorized sketch would be a separate implementation and would not > be able to take advantage of these points. > > Lee. > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > I don't think there is a problem with the DataSketches library, just that > it doesn't support what I am trying to do -- looking in the documentation, > it only supports streams of ints or floats, and those situations work fine > for me. Here's what I did: > - began with the KLL test .py file: > https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=oVBbHOLxj1c4WzQr%2BZAROfSEZwmvIf4fpPKhVi409cY%3D&reserved=0> > - replaced line 30 with kll.update(np.ones(10) * randn()) to have a Numpy > array of 10 identical values. > - ran the code > > This leads to the following error, as expected: > TypeError: update(): incompatible function arguments. The following > argument types are supported: > 1. (self: datasketches.kll_floats_sketch, item: float) -> None > > Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, > array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, > -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) > > It's not coded to support Numpy arrays, therefore it complains. What I > would ideally like to have happen in this scenario is it would treat each > element in the array as a separate stream. Then, later when getting a > given quantile, it would give 10 values, one for each stream. I don't see > an easy approach to implementing this on the Python side besides a very > slow iterative approach, and admittedly my C++ is quite rusty so I haven't > looked into the codebase to see how I might modify things there to support > this functionality. > > Re: the streaming-quantiles code being easily modified, I believe the only > necessary changes would be changing the Compactor class to be a subclass of > numpy.ndarray, rather than list, and implementing methods for the > list-specific methods that are used, like .append(). Then, it isn't > necessary to loop over the streams since we can make use of Numpy's > broadcasting, which will handle the looping in its C++ code, as you > mentioned. I'll work on this and see if it really is as straight-forward > as it seems. > > If you have any advice on how to use DataSketches for my problem, I'm > certainly open to that. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 4:37 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Thank you for considering the DataSketches library. I am adding this > thread to our [email protected] so that our whole team can > contribute to finding a solution for you. > > WRT the error you experienced, please help us help you by sharing with us > what the exact error was. > > We are about to release a major upgrade to the DataSketches C++/Python > product in the next few weeks. We have fixed a number of stability issues > and bugs, which may solve the problem. Nonetheless, we want to work with > you to get your problem solved. > > Updating 1e5 sketches in a system is not a problem in Java or C++. We > have real-time systems today that generate and process over 1e9 sketches > every day. Unfortunately our experience tells us that looping in Python > code will be 10 to 100 times slower than Java or C++. This is because the > code would have to switch from Python to C++ for every vector element. > > By comparison, the streaming-quantiles code could be easily modified to > use Numpy arrays and operate on vectors. > > > I would like to understand more about what you have in mind that would be > "easily modified". > > NumPy achieves its speed performance by doing all of the matrix operations > in pre-compiled C++ code. To achieve best performance, we would want to > read and loop through the NumPy data structure on the C++ side leveraging > the C++ DataSketches library directly. I am not sure what would be > involved to actually accomplish that. > > But first we need to get your Python + NumPy code working correctly with > our library so we can find out what its actual performance is. > > Cheers, > > Lee. > > > > > > On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]> > wrote: > > Hi Edo, Lee, > > Thanks for the prompt response. I looked at the datasketches library, and > while it seems to have a lot more features, it looks like it'll be a lot > more difficult to get it to work for my desired use case. > > My problem is that I need quantiles for each element of a vector (length > on the order of 1e4 -- 1e5), for some finite stream of vectors (on the > order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, > but it throws an error, so it doesn't seem like datasketches handles this > situation currently. > > To use datasketches, I think I would need to instantiate 1 object per > vector element, and I suspect this will slow things down considerably due > to iterating over the objects when each vector is processed. By > comparison, the streaming-quantiles code could be easily modified to use > Numpy arrays and operate on vectors. I ran a few unit tests on both codes > and found equivalent behavior, as expected. > > Do you have any recommendation(s) for this situation? Are there known > limitations of the streaming-quantiles code that would cause issues for my > use case? Are the other methods offered in datasketches 'better' than the > KLL implemented in streaming-quantiles? I'm quite out of my area of > expertise, so I appreciate any advice you can offer, and I will of course > acknowledge it in the publication. > > Best, > Michael > > ------------------------------ > *From:* Edo Liberty <[email protected]> > *Sent:* Tuesday, May 5, 2020 8:09 PM > *To:* Lee Rhodes <[email protected]>; Michael Himes < > [email protected]> > *Cc:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > +Lee > > Hi Michael, Thanks for reaching out. > While you can certainly do that, I recommend using the python-Binded > datasketches library. It will be more robust, faster, and bug free than my > code :) > > On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote: > > Hi Edo, > > I'm currently working on a Python package for machine-learning-accelerated > exoplanet modeling. It is free and open source (see here if you're curious > https://github.com/exosports/HOMER > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083783391&sdata=8xK%2F9E7wNDY30KDOfzmb5UPb2XEHeEaUnFAuhyIrX44%3D&reserved=0>), > and it's meant purely for reproducible academic research. > > I'm adding some new features to the software, and one of them requires > computing quantiles for a data set that cannot fit into memory. After > searching around for different methods to do this, your KLL method seemed > to be a good option in terms of speed and space requirements. > > Rather than reinvent the wheel and code my own implementation of the > method from scratch, I was wondering if you'd be willing to allow me to use > your code? I don't see a license, so I wanted to make sure you're okay > with this. I could implement it as a submodule within my repo, or I could > only include the kll.py file and add some additional comments pointing to > your repo and such, whichever you prefer. > > Best, > Michael > > -- > From my cell phone. > >
