I also used k=160, so in this case we matched nicely. And the bunches of 2^5 or 2^7 you were testing is exactly what I meant when referring to batched inputs. So that's good news.
I'll take a more careful look through the code -- there was something with update using arrays of templated type A which was always cast to double, for instance. But this is certainly promising. jon On Tue, May 19, 2020 at 3:32 PM Michael Himes <[email protected]> wrote: > Great tests (especially with the ordering), Jon! > > I did some scaling tests for dimensionality (1, 10, and 100), and this is > where I think the Numpy version shows its benefits. I performed a test > similar to your setup: > - each sketch has k = 160 (unsure what you used for this value, if it > matters) > - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal) > - get_quantiles(0.5) > > d=1 -- 84 s (this is the 123 s case you ran) > d=10 -- 88 s > d=100 -- 294 s > d=1000 -- 2298 s (did this one for fun, but there is a lot of variability > in runtime) > > Note that I did not use a single-value method, just the Numpy version. > Also, I checked the compute cost of the Python loop, and it's about 1 > second, so most of that ~80 seconds is the communication between Python and > C++. The scaling relation looks to be better than linear, but there needs > to be a few more tests here to really determine that. > > But, as Lee pointed out, there is non-negligible overhead from crossing > the bridge between Python and C++. It's small, but when doing it 2^25 > times it adds up. The Numpy implementation allows you to cross that bridge > much less often, albeit at the cost of some extra time programming that > part. If I set up a queue that holds 2^5 values and then updates it, it's > quite a bit better. Here are the results for the same dimensions as before: > > d=1 -- 8 s > d=10 -- 31 s > d=100 -- 257 s > > So, even with a small queue of 32 values, we see that a single sketch > using kll_sketches is faster than a kll_sketch by a factor of 2-3. And > with the batch set to 2^7 values (this is how I use it in my project): > d=1 -- 4.2 s > d=10 -- 27 s > d=100 -- 251 s > > The speed gain doesn't seem to scale with dimensionality, but I think that > has more to do with the compute overhead of generating the data since Numpy > tends to be faster when working in 1D vs multiple dimensions. But we can > see that it's possible to get runtimes much closer to C++ runtimes than > would be expected. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Tuesday, May 19, 2020 4:58 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Well, one thought was maybe we could always use the vectorized kll in > python and make it (relatively) easy to have it work with only 1 dimension. > It looks like there's still a non-trivial performance hit from that. But > wow.. I realized I could try something simple like reversing the > declaration order of single-update vs vector-update in the wrapper class. > And that dropped it to 35s! > > With that, it may be worth exploring a unified wrapper that handles single > items or vectors. > > jon > > On Tue, May 19, 2020 at 1:52 PM leerho <[email protected]> wrote: > > We had a similar issue in Java trying to use JNI to access C code. Every > transition across the "boundary" between Java and C took from 10 to 100 > microseconds. This made the JNI option pretty useless from our > standpoint. > > I don't know python that well, but I could well imagine that there may be > a similar issue here in moving data between Python and C++. > > That being said, compared to brute-force computation of these types of > queries in Python vs using even these (what we consider slow performing) > sketches in Python still may be a huge win. > > Lee. > > > > On Tue, May 19, 2020 at 1:28 PM Jon Malkin <[email protected]> wrote: > > I tried comparing the performance of the existing floats sketch vs the new > thing with a single dimension. And then I made a second update method that > handles a single item rather than creating an array of length 1 each time. > Otherwise, the scripts were as identical as possible. I fed in 2^25 > gaussian-distributed values and queried for the mean to force some > computation on the sketch. I think get_quantile(0.5) vs > get_quantiles(0.5)[0][0] was the only difference, > > Existing kll_floats_sketch: 31s > kll_floatarray_sketches: 123s > with single-item update: 80s > > Same test in c++: 1.7s (I can get it to 1.4s but that's using a worse RNG > so this seemed more fair) > > I didn't try anything with trying to batch updates, even though in theory > the new object can support that. This was more a test to see the > performance impact of using it for all kll sketches. > > At some level, if you're already ok taking the speed hit for python vs C++ > then maybe it doesn't matter. But >2x still seems significant to me. > > jon > > On Thu, May 14, 2020 at 6:54 PM Michael Himes <[email protected]> > wrote: > > Great, I'll be submitting the pull request shortly. The codebase I'm > working with doesn't have any of the changes made in the past week or so, > hopefully that isn't too much of a hassle to merge. > > As an aside, my employer encourages us to contribute code to libraries > like this, so I'm happy to work on additional features for the Python > interface as needed. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Thursday, May 14, 2020 6:56 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > We've been polishing things up for a release, so that was one of several > things that we fixed over the last several days. Thank you for finding it! > > Anyway, if you're generally happy with the state of things (and are > allowed to under any employment terms), I'd encourage you to create pull > request to merge your changes into the main repo. It doesn't need to be > perfect as we can always make changes as part of the PR review or > post-merge. > > Thanks, > jon > > > On Mon, May 11, 2020 at 2:25 PM Michael Himes <[email protected]> > wrote: > > Thanks for taking a look, Jon. > > I pushed an update that address 2 & 4. > > #3 is actually something I had a question about. I've tested passing > numpy.nan into the update function, and it doesn't appear to break anything > (min, max, etc all still work correctly). However, the reported number of > items per sketch counts the nan entries. Is this the expected behavior, or > should the get_n() method return a number that does not count the nans it > has seen? I expected the latter, so I'm worried that numpy's nan is being > treated differently. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Monday, May 11, 2020 4:32 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > I didn't look in super close detail, but the code overall looks pretty > good. Comments are below. > > Note that not all of these necessarily need changes or replies. I'm just > trying to document things we'll want to think about for keeping the library > general-purpose (and we can always make changes after merging, of course). > > 1. I worry the name kll_sketches is confusingly similar to kll_sketch. > Maybe vector_kll_sketches? But if there's a way to extend KLL in the future > to operate on an entire vector at a time (vs treating each dimension > independently) that'd become confusing. I think an inherently vectorized > version would be a very different beast, but I always worry I'm not being > imaginative enough. If merging into the Apache codebase, I'd probably wait > to see what the file looks like with the renaming before a final decision > on moving to its own file. > > 2. What happens if the input to update() has >2 dimensions? If that'd be > invalid, we should explicitly check and complain. If it'll Do The Right > Thing by operating on the first 2 dimensions (meaning correct indices) > that's fine, but otherwise should probably complain. > > 3. Can this handle sparse input vectors? Not sure how important that is in > general, even if your project doesn't require it. kll_sketch will ignore > NaNs, so those appearing would mean the number of items per sketch can > already differ. > > 4. I'd probably eat the very slightly increased space and go with 32 bits > for the number of dimensions (aka number of sketches). If trying to look at > a distribution of values for some machine learning application, it'd be > easy to overflow 65k dimensions for some tasks. > > 5. I imagine you've realized that it's easiest to do unit tests from > python in this case. That's another advantage of having this live in the > wrapper. > > 6. Finally, that assert issue is already obsolete :). Asserts were > converted if/throw exceptions late last week. It'll be flagged as a > conflict in merging, so no worries for now. > > Looking good at this point. And as I said, not all of these need changes > or comments from you. > > jon > > On Mon, May 11, 2020 at 7:09 AM Michael Himes <[email protected]> > wrote: > > Understood, I went ahead and moved the new class to the kll_wrapper.cpp > file -- I'll leave it to you to decide if it's better as its own file. > > Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 > throws errors regarding the assert calls in kll_sketch_impl.hpp. I added > an include of assert.h there and then it compiled without issue. It's > possible that other compilers will also complain about that, so maybe this > is a good update to the main branch. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Sunday, May 10, 2020 10:47 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > My only comment without having looked at actual code is that the new class > would be more appropriate in the python wrapper. Maybe even drop it in as > it's own file, as that would decrease recompile time a bit when debugging > (that's pybind's suggestion, anyway). Probably not a huge difference with > how light these wrappers are. > > If this is something that becomes widely used, to where we look at pushing > it into the base library, we'd look at whether we could share any data > across sketches. But we're far from that point currently. It'd be nice to > need to consider that. > > jon > > On Sun, May 10, 2020, 7:33 PM leerho <[email protected]> wrote: > > Michael, this has been a great interchange and certainly will allow us to > move forward more quickly. > > Thank you for working on this on a Mother's Day Sunday! > > I'm sure Alex and Jon may have more questions, when they get a chance to > look at it starting tomorrow. > > Cheers, and be safe and well! > > Lee. > > On Sun, May 10, 2020 at 6:25 PM Michael Himes <[email protected]> > wrote: > > Re: testing, so far I've just done glorified unit tests for uniform and > normal distributions of varying sizes. I plan to do some timing tests vs > the existing single-sketch Python class to see how it compares for 1, 10, > and 100 streams. > > 1. That makes sense. One option to allow full Numpy compatibility but > without requiring a Python user to use Numpy would be to return everything > as lists, rather than Numpy arrays. Numpy users could then convert those > lists into arrays, and non-Numpy users would be unaffected (aside from > needing the pybind11/numpy.h header). Alternatively, some flag could be > set when instantiating the object that would control whether things are > returned as lists or arrays, though this still requires the numpy.h header > file. > > 2. I didn't change the kll_sketch code, I only defined a new (wrapper) > class called kll_sketches, which spawns a user-specified number of > sketches. Each of those sketches are kll_sketch objects and uses all of > the existing code for that. For fast execution in Python, the parallel > sketches must be spawned in C++, but the existing Python object could only > spawn a single sketch since it wraps the kll_sketch class. Perhaps the > kll_sketches class would be better placed in the python/src/kll_wrapper.cpp > file? I suppose you wouldn't need this class if you weren't using Python. > > 3. Yes, SerDe is very straight-forward here. I've marked some stuff as > todo's, and that is one of them -- the plan is to do like you described and > call the relevant kll_sketch method on each of the sketches and return that > to Python in a sensible format. For deserialization, it would just iterate > through them and load them into the kll_sketches object. I don't require > it for my project, so I didn't bother to wrap that yet -- I'll take a look > sometime this week after I finish my work for the day, shouldn't take long > to do. > > 4. That makes sense. Does using Numpy complicate that at all? My thought > is that since under the hood everything is using the existing kll_sketch > class, it would have full compatibility with the rest of the library (once > SerDe is added in). > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Sunday, May 10, 2020 8:42 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Thanks for the link to your code. My colleagues, Jon and Alex, will take > a closer look this next week. They wrote this code so they are much closer > to it than I. > > What you have done so far makes sense for you as you want to get this > working in the NumPy environment as quickly as possible. As soon as we > start thinking about incorporating this into our library other concerns > become important. > > 1. Adding API calls is the recommended way to add functionality (like > NumPy) to a library. We cannot change API calls in a way that is only > useful with NumPy, because it would seriously impact other users of the > library that don't need NumPy. If both sets of calls cannot simultaneously > exist in the same sketch API, then we need to consider other alternatives. > > 2. Based on our previous discussions, I didn't envision that you would > have to change the kll_sketch code itself other than perhaps a "wrapper" > class that enables vectorized input to a vector of sketches and a > vectorized get result that creates a vector result from a vector of > sketches. This would isolate the changes you need for NumPy from the > sketch itself. This is also much easier to support, maintain and debug. > > 3. If you don't change the internals of the sketch then SerDe becomes > pretty straightforward. I don't know if you need a single serialization > that represents a full vector of sketches, but if you do, then I would > just iterate over the individual serdes and figure out how to package it. > I really don't think you want to have to rewrite this low-level stuff. > > 4. Binary compatibility is critically important for us and I think will be > important for you as well. There are two dimensions of binary > compatibility: history and language. This means that a kll sketch > serialized from Java, can be successfully read by C++ and visa versa. > Similarly, a kll sketch serialized today will be able to be read many years > from now. Another aspect of this would mean being able to collect, say, > 100 sketches that were not created using the NumPy version, and being able > to put them together in a NumPy vector; and visa versa. > > I hope all of this make sense to you. > > Cheers, > > Lee. > > > > On Sun, May 10, 2020 at 4:21 PM leerho <[email protected]> wrote: > > Michael, > This is great! What testing have you been able to do so far? > > > On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]> > wrote: > > Lee, > > Thanks for all of that information, it's quite helpful to get a better > understanding of things. > > I've put the code on Github if you'd like to take a look: > https://github.com/mdhimes/incubator-datasketches-cpp > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0> > > Changes are > - new class in kll/include/kll_sketch.hpp, w/ associated constructor in > kll/include/kll_sketch_impl.hpp. This class spawns a specified number of > sketches. > - new Python interface functions in python/src/kll_wrapper.cpp > > The only new dependency introduced is the pybind11/numpy.h header file. > The new Numpy-compatible Python classes retain identical functionality to > the existing classes (with minor changes to method names, e.g., > get_min_value --> get_min_values), except that I have not yet implemented > merging or (de)serialization. These would be straight-forward to > implement, if needed. > > Re: characterization tests, I'll take a look at those tests you linked to > and see about running them, time and compute resources permitting. > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Sunday, May 10, 2020 5:32 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Is there a place on GitHub somewhere where I could look at your code so > far? The reason I ask, is before you do a PR, we would like to determine > where a contribution such as this should be placed. > > Our library is split up among different repositories, determined by > language and dependencies. This keeps the user downloads smaller and more > focused. We have two library repos for the core sketch algorithms, one > for Java and one for C++/Python, where the dependencies are very lean, > which simplifies integration into other systems. We have separate repos > for adaptors, which depend on one of the core repos. On the Java side, we > have separate repos for adaptors for Apache Hive and Apache Pig, as the > dependencies for each of these are quite large. For C++, we have a > dedicated repo for the adaptors for PostgreSQL. > > Some of our adaptors are hosted with the target system. For example, our > Druid adaptors were contributed directly into Apache Druid. > > I assume your code has dependencies on Python, NumPy and DataSketches-cpp. > It is not clear to me at the moment whether we should create a separate > repo for this or have a separate group of directories in our cpp repo. > > **** > We have a separate repo for our characterization code, which is not > formally "released" as an Apache release. It exists because we want others > to be able to reproduce (or challenge) our claims of speed performance or > accuracy. It is the one repo where we have all languages and many > different dependencies. The coding style is not as rigorous or as well > documented as our repos that do have formal releases. > > Characterization testing is distinctly different from Unit Tests, which > basically checks all the main code paths and makes sure that the program > works as it should. The key metric is code coverage and Unit Tests should > be fast as it is run on every check-in of new code. Characterization is > also different from Integration Testing, which is testing how well the code > works when integrated into larger systems. > > Characterization tests are unique to our kind of library. Because our > algorithms are probabilistic in nature, in order to verify accuracy or > speed performance we need to run many thousands of trials to eliminate > statistical noise in the results. And when the data is large, this can > take a long time. You can peruse our website for many examples as all the > plots result from various characterization studies. What appears on the > website is but a small fraction of all the testing we have done. > > There are no "standard" tests as every sketch is different so we have to > decide what is important to measure for a particular sketch, but the basic > groups are *speed* and *accuracy*. > > For speed there are many possible measurements, but the basic ones are > update speed, merge speed, Serialization / Deserialization speed, get > estimate or get result speeds. > > For accuracy we want to validate that the sketch is performing within the > bounds of the theoretical error distribution. We want to measure this > accuracy in the context of a stand-alone, purely streaming sketch and also > in the context of merging many sketches together. > > We also try to do these same tests comparing the results against other > alternatives users might have. We have performed these same > characterizations on other publically available sketches as well as against > traditional, brute-force approaches to solving the same problem. > > For the solution you have developed, we would depend on you to decide what > properties would be most important to characterize for users of this > solution. It should be very similar to what you would write in a paper > describing this solution; you want to convince the reader that this is > very useful and why. > > Since the first sketch you have leveraged is the KLL quantiles sketch, I > would think you would want some characterizations similar to what we did > for our studies > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0> > comparing our older quantiles sketch and the KLL sketch. > > **** > For the Java characterization tests, we have "standardized" on having > small configuration files which define the key parameters of the test. > These are simple text files > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0> > of key-value pairs. We don't have any centralized definition of these > pairs, just that they are human readable and intelligible. They are > different for each type of sketch. > > For the C++ tests, we don't have a collection of config files yet (this is > one of our TODOs), but the same kind of parameters are set in the code > itself. > > We will likely want to set up a separate directory for your > characterization tests. > > I hope you find this helpful. > > Cheers, > > Lee. > > On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]> > wrote: > > The code is in a good state now. It can take individual values, lists, or > Numpy arrays as input, and it returns back Numpy arrays. There are some > additional features, like being able to specify which sketches the user > wants to, e.g., get quantiles for. > > But, I have only done minor testing with uniform and normal > distributions. I'd like to put it through more extensive testing (and some > documentation) before releasing it, and it sounds like your > characterization tests are the way to go -- it's not science if it's not > reproducible! Is there a standard set of tests for this purpose? If not, > are there standard tests that have been used for the existing codebase? > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Saturday, May 9, 2020 7:21 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is great. The first step is to get your project working! Once you > think you are ready, it would be really useful if you could do some > characterization testing in the NumPy environment. Characterization tests > are what we run to fully understand how a sketch performs over a range of > parameters and using thousands to millions of trials. You can see some of > the accuracy and speed performance plots of various sketches on our > website. Sometimes these can take hours to run. We typically use > synthetic data to drive our characterization tests to make them > reproducible. > > Real data can also be used and one comparison test I would recommend is > comparing how long it takes to get approximate results using sketches > versus how long it would take to get exact results using brute force > methods. The bigger the data set is the better :) > > We don't have much experience with NumPy so this will be a new environment > for us. But before you get too deep into this please get us involved. We > have been characterizing these streaming algorithms for a number of years, > and would like to help you. > > Cheers, > > Lee. > > On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]> > wrote: > > I'm not quite sure what being a committer entails, but yeah I'm happy to > contribute. I can't commit a lot of time to working on it, but with how > things went for KLL I don't think it will take a lot of time for the other > sketches if they are formatted in a similar manner. Getting this library > integrated into numpy/scipy would be awesome, I'm sure I could get some > others in my field to begin using it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 5:06 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is just awesome! Would you be interested in becoming a committer on > our project? It is not automatic, but we could work with you to bring you > up to speed on the other sketches in the library. If you could help us > integrate DataSketches into NumPy and possibly SciPy (not sure if this is > necessary) it would be a very significant contribution and we would > definitely want you to be part of our community! > > Thanks, > > Lee. > > On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > Thanks for the notice, I went ahead and subscribed to the list. > > As for Jon's email, this is actually what I have currently implemented! > Once I finish ironing out a couple improvements, I'm going to move some > code around to follow the existing coding style, put it on Github, and > submit a pull request. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 4:22 PM > *To:* Michael Himes <[email protected]> > *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Hi Michael, > I don't think you saw this email as I doubt you are subscribed to our > [email protected] email list. > > We would like to have you as part of our larger community, as others might > also have suggestions on how to move your project forward. > You can subscribe by sending an empty email to > [email protected]. > > Lee. > > ---------- Forwarded message --------- > From: *Jon Malkin* <[email protected]> > Date: Thu, May 7, 2020 at 4:11 PM > Subject: Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > To: <[email protected]> > Cc: Lee Rhodes <[email protected]>, Edo Liberty < > [email protected]>, [email protected] <[email protected]> > > > We're using pybind11 to get a C++ interface with python (vs raw C). The > wrappers themselves are quite thin, but they do have examples of calling > functions defined in the wrapper as opposed to only the sketch object. > > I believe the easiest way to do this will be to define a pretty simple C++ > object and create a pybind wrapper for it. That object would contain a > std::vector<kll_sketch>. Then you'd define an update method for your > custom object that iterates through a numpy array and calls update() on the > appropriate sketch. You'd also want to define something similar for > get_quantile() or whatever other methods you need that iterates through > that vector of sketches and returns the result in a numpy array. > > That's a pretty lightweight object. And then you'd use a similar thin > pybind wrapper around it to make it play nicely with python. Since our C++ > library is just templates, you'd end up with a free-standing library, with > no requirement that the base datasketches library be involved. > > jon > > On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]> > wrote: > > I would be happy to share whatever I come up with (if anything). The lack > of a Numpy/Scipy implementation is what led me to the DataSketches library, > it would be very useful to myself and others if it were a part of > Numpy/Scipy. > > For what it's worth, passing in a Numpy array and manipulating it from the > C++ side is quite easy. On the other hand, figuring out how to spawn m > sketches and pass the values along to that looks like it'll be more > challenging, there is a lot of code here and it'll take some time for me to > familiarize myself with it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Thursday, May 7, 2020 12:00 PM > *To:* Michael Himes <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]>; [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > If you do figure out how to do this, it would be great if you could share > it with us. We would like to extend it to other sketches and submit it as > an added functionality to NumPy. I have been looking at the NomPy and > SciPy libraries and have not found anything close to what we have. > > Lee. > > > On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]> > wrote: > > Hi Lee, Jon, > > Thanks for the information. I tried to vectorize things this morning and > ran into that exact problem -- since the offsets can differ, it leads to > slices of different lengths, which wouldn't be possible to store as a > single Numpy array. > > Lee, your understanding of my problem is spot on. n vectors of size m, > where all m elements of each vector are a float (no NaNs or missing > values). I am interested in quantiles at rank r for each of the m > streams. Only 1 sketch will operate simultaneously, saving/loading the > sketch is not required (though it would be a nice feature), and sketches > would not need to be merged (no serialization/deserialization). > > Not surprisingly, it looks like your original suggestion of handling this > on the C++ side is the way to go. Once I have time to dive into the code, > my plan is to write something that implements what you described in the > earlier email. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 10:43 PM > *To:* Michael Himes <[email protected]> > *Cc:* [email protected] <[email protected]>; Edo > Liberty <[email protected]>; [email protected] <[email protected]> > > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not > work for another reason and that is for each dimension, choosing whether to > delete the odd or even values in the compactor must be random and > independent of the other dimensions. Otherwise you might get unwanted > correlation effects between the dimensions. > > This is another argument that you should have independent compactors for > each dimension. So you might as well stick with individual sketches for > each dimension. > > Lee. > > On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]> > wrote: > > Michael, > > Allow me to back up for a moment to make sure I understand your problem. > > You have a large number of large vectors of the form *V_n = {x_i}:* *n* > vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th > element, or equivalently, the *i*th dimension. > > Assumptions: > > - All vectors, *V*, are of the same size *m.* > - All elements, *x_i*, are valid numbers of the same type. No missing > values, and if you are using *floats*, this means no *NaN*s. > > In aggregate, the *n* vectors represent *m* *independent* distributions > of values. > > Your task is to be able to obtain *m* quantiles at rank *r* in a single > query. > > **** > To do this, using your idea, would require vectorization of the entire > sketch and not just the compactors. The inputs are vectors, the result of > operations such as getQuantile(r), getQuantileUpperBound(r), > getQuantileLowerBound(r), are also vectors. > > This sketch will be a large data structure, which leads to more questions > ... > > - Do you anticipate having many of these vectorized sketches operating > simultaneously? > - Is there any requirement to store and later retrieve this sketch? > - Or, the nearly equivalent question: Do you require merging of these > sketches (across clusters, for example)? Which also means serialization > and deserialization. > > I am concerned that this vector-quantiles sketch would be limited in the > sense that it may not be as widely applicable as it could be. > > Our experience with real data is that it is ugly with missing values, NaN, > nulls, etc. Which means we would not be able to vectorize the compactor. > Each dimension *i* would need a separate independent compactor because > the compaction times will vary depending on missing values or NaNs in the > data. > > Spacewise, I don't think having separate independent sketches for each > dimension would be much smaller than vectorizing the entire sketch, because > the internals of the existing sketch are already quite space efficient > leveraging compact arrays, etc. > > As a first step I would favor figuring out how to access the NumPy data > structure on the C++ side, having individual sketches for each > dimension, and doing the iterations updating the sketches in C++. It also > has the advantage of leveraging code that exists and it would automatically > be able to leverage any improvements to the sketch code over time. In > addition, it could be a prototype of how to integrate other sketches into > the NumPy ecosystem. > > A fully vectorized sketch would be a separate implementation and would not > be able to take advantage of these points. > > Lee. > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > I don't think there is a problem with the DataSketches library, just that > it doesn't support what I am trying to do -- looking in the documentation, > it only supports streams of ints or floats, and those situations work fine > for me. Here's what I did: > - began with the KLL test .py file: > https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0> > - replaced line 30 with kll.update(np.ones(10) * randn()) to have a Numpy > array of 10 identical values. > - ran the code > > This leads to the following error, as expected: > TypeError: update(): incompatible function arguments. The following > argument types are supported: > 1. (self: datasketches.kll_floats_sketch, item: float) -> None > > Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, > array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, > -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) > > It's not coded to support Numpy arrays, therefore it complains. What I > would ideally like to have happen in this scenario is it would treat each > element in the array as a separate stream. Then, later when getting a > given quantile, it would give 10 values, one for each stream. I don't see > an easy approach to implementing this on the Python side besides a very > slow iterative approach, and admittedly my C++ is quite rusty so I haven't > looked into the codebase to see how I might modify things there to support > this functionality. > > Re: the streaming-quantiles code being easily modified, I believe the only > necessary changes would be changing the Compactor class to be a subclass of > numpy.ndarray, rather than list, and implementing methods for the > list-specific methods that are used, like .append(). Then, it isn't > necessary to loop over the streams since we can make use of Numpy's > broadcasting, which will handle the looping in its C++ code, as you > mentioned. I'll work on this and see if it really is as straight-forward > as it seems. > > If you have any advice on how to use DataSketches for my problem, I'm > certainly open to that. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 4:37 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Thank you for considering the DataSketches library. I am adding this > thread to our [email protected] so that our whole team can > contribute to finding a solution for you. > > WRT the error you experienced, please help us help you by sharing with us > what the exact error was. > > We are about to release a major upgrade to the DataSketches C++/Python > product in the next few weeks. We have fixed a number of stability issues > and bugs, which may solve the problem. Nonetheless, we want to work with > you to get your problem solved. > > Updating 1e5 sketches in a system is not a problem in Java or C++. We > have real-time systems today that generate and process over 1e9 sketches > every day. Unfortunately our experience tells us that looping in Python > code will be 10 to 100 times slower than Java or C++. This is because the > code would have to switch from Python to C++ for every vector element. > > By comparison, the streaming-quantiles code could be easily modified to > use Numpy arrays and operate on vectors. > > > I would like to understand more about what you have in mind that would be > "easily modified". > > NumPy achieves its speed performance by doing all of the matrix operations > in pre-compiled C++ code. To achieve best performance, we would want to > read and loop through the NumPy data structure on the C++ side leveraging > the C++ DataSketches library directly. I am not sure what would be > involved to actually accomplish that. > > But first we need to get your Python + NumPy code working correctly with > our library so we can find out what its actual performance is. > > Cheers, > > Lee. > > > > > > On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]> > wrote: > > Hi Edo, Lee, > > Thanks for the prompt response. I looked at the datasketches library, and > while it seems to have a lot more features, it looks like it'll be a lot > more difficult to get it to work for my desired use case. > > My problem is that I need quantiles for each element of a vector (length > on the order of 1e4 -- 1e5), for some finite stream of vectors (on the > order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, > but it throws an error, so it doesn't seem like datasketches handles this > situation currently. > > To use datasketches, I think I would need to instantiate 1 object per > vector element, and I suspect this will slow things down considerably due > to iterating over the objects when each vector is processed. By > comparison, the streaming-quantiles code could be easily modified to use > Numpy arrays and operate on vectors. I ran a few unit tests on both codes > and found equivalent behavior, as expected. > > Do you have any recommendation(s) for this situation? Are there known > limitations of the streaming-quantiles code that would cause issues for my > use case? Are the other methods offered in datasketches 'better' than the > KLL implemented in streaming-quantiles? I'm quite out of my area of > expertise, so I appreciate any advice you can offer, and I will of course > acknowledge it in the publication. > > Best, > Michael > > ------------------------------ > *From:* Edo Liberty <[email protected]> > *Sent:* Tuesday, May 5, 2020 8:09 PM > *To:* Lee Rhodes <[email protected]>; Michael Himes < > [email protected]> > *Cc:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > +Lee > > Hi Michael, Thanks for reaching out. > While you can certainly do that, I recommend using the python-Binded > datasketches library. It will be more robust, faster, and bug free than my > code :) > > On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote: > > Hi Edo, > > I'm currently working on a Python package for machine-learning-accelerated > exoplanet modeling. It is free and open source (see here if you're curious > https://github.com/exosports/HOMER > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>), > and it's meant purely for reproducible academic research. > > I'm adding some new features to the software, and one of them requires > computing quantiles for a data set that cannot fit into memory. After > searching around for different methods to do this, your KLL method seemed > to be a good option in terms of speed and space requirements. > > Rather than reinvent the wheel and code my own implementation of the > method from scratch, I was wondering if you'd be willing to allow me to use > your code? I don't see a license, so I wanted to make sure you're okay > with this. I could implement it as a submodule within my repo, or I could > only include the kll.py file and add some additional comments pointing to > your repo and such, whichever you prefer. > > Best, > Michael > > -- > From my cell phone. > >
