Michael, I've restructured the object to be an actual C++ object with proper methods. And then I've gotten rid of all the casts to buffer in favor of just using the py::array_t<> that's passed in. Tha removes casting everything to double, and allows for range checks. Now an attempt to access sketch 7 in a 5-d array doesn't just segfault :)
Looking at pybind docs a bit more, it seems there are no hard guarantees on data layout in memory with numpy arrays -- if you transpose one, walking through with a pointer will return items In the wrong order. So update() ends up using items.at() instead (more on that in a moment). The whole thing is probably also copying values around more than necessary. Anyway, we can look at ways to optimize such things eventually, but for now I'm working on ensuring correctness and at least somewhat graceful failure. Anyway, item input order. If we have 1-d input, we implicitly assume we want d updates, one for each dimension in the object. It seems like the default for numpy is row-major order, which makes sense given C beneath the hood. But for inputting n points at a time, do you expect the matrix to be (d x n) or (n x d)? jon On Tue, May 19, 2020, 5:20 PM Michael Himes <[email protected]> wrote: > Re: the template type A, I set that for the Python array data type. A > Python float is 64 bits, so that is a C++ double. I thought it was > necessary to set the py::array_t data type since I think it's a template, > but I could be mistaken. > > Michael > > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Tuesday, May 19, 2020 7:46 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Excellent work! > > On Tue, May 19, 2020 at 4:04 PM Jon Malkin <[email protected]> wrote: > > I also used k=160, so in this case we matched nicely. And the bunches of > 2^5 or 2^7 you were testing is exactly what I meant when referring to > batched inputs. So that's good news. > > I'll take a more careful look through the code -- there was something with > update using arrays of templated type A which was always cast to double, > for instance. But this is certainly promising. > > jon > > On Tue, May 19, 2020 at 3:32 PM Michael Himes <[email protected]> > wrote: > > Great tests (especially with the ordering), Jon! > > I did some scaling tests for dimensionality (1, 10, and 100), and this is > where I think the Numpy version shows its benefits. I performed a test > similar to your setup: > - each sketch has k = 160 (unsure what you used for this value, if it > matters) > - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal) > - get_quantiles(0.5) > > d=1 -- 84 s (this is the 123 s case you ran) > d=10 -- 88 s > d=100 -- 294 s > d=1000 -- 2298 s (did this one for fun, but there is a lot of variability > in runtime) > > Note that I did not use a single-value method, just the Numpy version. > Also, I checked the compute cost of the Python loop, and it's about 1 > second, so most of that ~80 seconds is the communication between Python and > C++. The scaling relation looks to be better than linear, but there needs > to be a few more tests here to really determine that. > > But, as Lee pointed out, there is non-negligible overhead from crossing > the bridge between Python and C++. It's small, but when doing it 2^25 > times it adds up. The Numpy implementation allows you to cross that bridge > much less often, albeit at the cost of some extra time programming that > part. If I set up a queue that holds 2^5 values and then updates it, it's > quite a bit better. Here are the results for the same dimensions as before: > > d=1 -- 8 s > d=10 -- 31 s > d=100 -- 257 s > > So, even with a small queue of 32 values, we see that a single sketch > using kll_sketches is faster than a kll_sketch by a factor of 2-3. And > with the batch set to 2^7 values (this is how I use it in my project): > d=1 -- 4.2 s > d=10 -- 27 s > d=100 -- 251 s > > The speed gain doesn't seem to scale with dimensionality, but I think that > has more to do with the compute overhead of generating the data since Numpy > tends to be faster when working in 1D vs multiple dimensions. But we can > see that it's possible to get runtimes much closer to C++ runtimes than > would be expected. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Tuesday, May 19, 2020 4:58 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Well, one thought was maybe we could always use the vectorized kll in > python and make it (relatively) easy to have it work with only 1 dimension. > It looks like there's still a non-trivial performance hit from that. But > wow.. I realized I could try something simple like reversing the > declaration order of single-update vs vector-update in the wrapper class. > And that dropped it to 35s! > > With that, it may be worth exploring a unified wrapper that handles single > items or vectors. > > jon > > On Tue, May 19, 2020 at 1:52 PM leerho <[email protected]> wrote: > > We had a similar issue in Java trying to use JNI to access C code. Every > transition across the "boundary" between Java and C took from 10 to 100 > microseconds. This made the JNI option pretty useless from our > standpoint. > > I don't know python that well, but I could well imagine that there may be > a similar issue here in moving data between Python and C++. > > That being said, compared to brute-force computation of these types of > queries in Python vs using even these (what we consider slow performing) > sketches in Python still may be a huge win. > > Lee. > > > > On Tue, May 19, 2020 at 1:28 PM Jon Malkin <[email protected]> wrote: > > I tried comparing the performance of the existing floats sketch vs the new > thing with a single dimension. And then I made a second update method that > handles a single item rather than creating an array of length 1 each time. > Otherwise, the scripts were as identical as possible. I fed in 2^25 > gaussian-distributed values and queried for the mean to force some > computation on the sketch. I think get_quantile(0.5) vs > get_quantiles(0.5)[0][0] was the only difference, > > Existing kll_floats_sketch: 31s > kll_floatarray_sketches: 123s > with single-item update: 80s > > Same test in c++: 1.7s (I can get it to 1.4s but that's using a worse RNG > so this seemed more fair) > > I didn't try anything with trying to batch updates, even though in theory > the new object can support that. This was more a test to see the > performance impact of using it for all kll sketches. > > At some level, if you're already ok taking the speed hit for python vs C++ > then maybe it doesn't matter. But >2x still seems significant to me. > > jon > > On Thu, May 14, 2020 at 6:54 PM Michael Himes <[email protected]> > wrote: > > Great, I'll be submitting the pull request shortly. The codebase I'm > working with doesn't have any of the changes made in the past week or so, > hopefully that isn't too much of a hassle to merge. > > As an aside, my employer encourages us to contribute code to libraries > like this, so I'm happy to work on additional features for the Python > interface as needed. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Thursday, May 14, 2020 6:56 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > We've been polishing things up for a release, so that was one of several > things that we fixed over the last several days. Thank you for finding it! > > Anyway, if you're generally happy with the state of things (and are > allowed to under any employment terms), I'd encourage you to create pull > request to merge your changes into the main repo. It doesn't need to be > perfect as we can always make changes as part of the PR review or > post-merge. > > Thanks, > jon > > > On Mon, May 11, 2020 at 2:25 PM Michael Himes <[email protected]> > wrote: > > Thanks for taking a look, Jon. > > I pushed an update that address 2 & 4. > > #3 is actually something I had a question about. I've tested passing > numpy.nan into the update function, and it doesn't appear to break anything > (min, max, etc all still work correctly). However, the reported number of > items per sketch counts the nan entries. Is this the expected behavior, or > should the get_n() method return a number that does not count the nans it > has seen? I expected the latter, so I'm worried that numpy's nan is being > treated differently. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Monday, May 11, 2020 4:32 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > I didn't look in super close detail, but the code overall looks pretty > good. Comments are below. > > Note that not all of these necessarily need changes or replies. I'm just > trying to document things we'll want to think about for keeping the library > general-purpose (and we can always make changes after merging, of course). > > 1. I worry the name kll_sketches is confusingly similar to kll_sketch. > Maybe vector_kll_sketches? But if there's a way to extend KLL in the future > to operate on an entire vector at a time (vs treating each dimension > independently) that'd become confusing. I think an inherently vectorized > version would be a very different beast, but I always worry I'm not being > imaginative enough. If merging into the Apache codebase, I'd probably wait > to see what the file looks like with the renaming before a final decision > on moving to its own file. > > 2. What happens if the input to update() has >2 dimensions? If that'd be > invalid, we should explicitly check and complain. If it'll Do The Right > Thing by operating on the first 2 dimensions (meaning correct indices) > that's fine, but otherwise should probably complain. > > 3. Can this handle sparse input vectors? Not sure how important that is in > general, even if your project doesn't require it. kll_sketch will ignore > NaNs, so those appearing would mean the number of items per sketch can > already differ. > > 4. I'd probably eat the very slightly increased space and go with 32 bits > for the number of dimensions (aka number of sketches). If trying to look at > a distribution of values for some machine learning application, it'd be > easy to overflow 65k dimensions for some tasks. > > 5. I imagine you've realized that it's easiest to do unit tests from > python in this case. That's another advantage of having this live in the > wrapper. > > 6. Finally, that assert issue is already obsolete :). Asserts were > converted if/throw exceptions late last week. It'll be flagged as a > conflict in merging, so no worries for now. > > Looking good at this point. And as I said, not all of these need changes > or comments from you. > > jon > > On Mon, May 11, 2020 at 7:09 AM Michael Himes <[email protected]> > wrote: > > Understood, I went ahead and moved the new class to the kll_wrapper.cpp > file -- I'll leave it to you to decide if it's better as its own file. > > Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 > throws errors regarding the assert calls in kll_sketch_impl.hpp. I added > an include of assert.h there and then it compiled without issue. It's > possible that other compilers will also complain about that, so maybe this > is a good update to the main branch. > > Michael > ------------------------------ > *From:* Jon Malkin <[email protected]> > *Sent:* Sunday, May 10, 2020 10:47 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > My only comment without having looked at actual code is that the new class > would be more appropriate in the python wrapper. Maybe even drop it in as > it's own file, as that would decrease recompile time a bit when debugging > (that's pybind's suggestion, anyway). Probably not a huge difference with > how light these wrappers are. > > If this is something that becomes widely used, to where we look at pushing > it into the base library, we'd look at whether we could share any data > across sketches. But we're far from that point currently. It'd be nice to > need to consider that. > > jon > > On Sun, May 10, 2020, 7:33 PM leerho <[email protected]> wrote: > > Michael, this has been a great interchange and certainly will allow us to > move forward more quickly. > > Thank you for working on this on a Mother's Day Sunday! > > I'm sure Alex and Jon may have more questions, when they get a chance to > look at it starting tomorrow. > > Cheers, and be safe and well! > > Lee. > > On Sun, May 10, 2020 at 6:25 PM Michael Himes <[email protected]> > wrote: > > Re: testing, so far I've just done glorified unit tests for uniform and > normal distributions of varying sizes. I plan to do some timing tests vs > the existing single-sketch Python class to see how it compares for 1, 10, > and 100 streams. > > 1. That makes sense. One option to allow full Numpy compatibility but > without requiring a Python user to use Numpy would be to return everything > as lists, rather than Numpy arrays. Numpy users could then convert those > lists into arrays, and non-Numpy users would be unaffected (aside from > needing the pybind11/numpy.h header). Alternatively, some flag could be > set when instantiating the object that would control whether things are > returned as lists or arrays, though this still requires the numpy.h header > file. > > 2. I didn't change the kll_sketch code, I only defined a new (wrapper) > class called kll_sketches, which spawns a user-specified number of > sketches. Each of those sketches are kll_sketch objects and uses all of > the existing code for that. For fast execution in Python, the parallel > sketches must be spawned in C++, but the existing Python object could only > spawn a single sketch since it wraps the kll_sketch class. Perhaps the > kll_sketches class would be better placed in the python/src/kll_wrapper.cpp > file? I suppose you wouldn't need this class if you weren't using Python. > > 3. Yes, SerDe is very straight-forward here. I've marked some stuff as > todo's, and that is one of them -- the plan is to do like you described and > call the relevant kll_sketch method on each of the sketches and return that > to Python in a sensible format. For deserialization, it would just iterate > through them and load them into the kll_sketches object. I don't require > it for my project, so I didn't bother to wrap that yet -- I'll take a look > sometime this week after I finish my work for the day, shouldn't take long > to do. > > 4. That makes sense. Does using Numpy complicate that at all? My thought > is that since under the hood everything is using the existing kll_sketch > class, it would have full compatibility with the rest of the library (once > SerDe is added in). > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Sunday, May 10, 2020 8:42 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Thanks for the link to your code. My colleagues, Jon and Alex, will take > a closer look this next week. They wrote this code so they are much closer > to it than I. > > What you have done so far makes sense for you as you want to get this > working in the NumPy environment as quickly as possible. As soon as we > start thinking about incorporating this into our library other concerns > become important. > > 1. Adding API calls is the recommended way to add functionality (like > NumPy) to a library. We cannot change API calls in a way that is only > useful with NumPy, because it would seriously impact other users of the > library that don't need NumPy. If both sets of calls cannot simultaneously > exist in the same sketch API, then we need to consider other alternatives. > > 2. Based on our previous discussions, I didn't envision that you would > have to change the kll_sketch code itself other than perhaps a "wrapper" > class that enables vectorized input to a vector of sketches and a > vectorized get result that creates a vector result from a vector of > sketches. This would isolate the changes you need for NumPy from the > sketch itself. This is also much easier to support, maintain and debug. > > 3. If you don't change the internals of the sketch then SerDe becomes > pretty straightforward. I don't know if you need a single serialization > that represents a full vector of sketches, but if you do, then I would > just iterate over the individual serdes and figure out how to package it. > I really don't think you want to have to rewrite this low-level stuff. > > 4. Binary compatibility is critically important for us and I think will be > important for you as well. There are two dimensions of binary > compatibility: history and language. This means that a kll sketch > serialized from Java, can be successfully read by C++ and visa versa. > Similarly, a kll sketch serialized today will be able to be read many years > from now. Another aspect of this would mean being able to collect, say, > 100 sketches that were not created using the NumPy version, and being able > to put them together in a NumPy vector; and visa versa. > > I hope all of this make sense to you. > > Cheers, > > Lee. > > > > On Sun, May 10, 2020 at 4:21 PM leerho <[email protected]> wrote: > > Michael, > This is great! What testing have you been able to do so far? > > > On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]> > wrote: > > Lee, > > Thanks for all of that information, it's quite helpful to get a better > understanding of things. > > I've put the code on Github if you'd like to take a look: > https://github.com/mdhimes/incubator-datasketches-cpp > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231864264&sdata=SJp244g4tj6d%2B1Z0jS1uldEs7VUuKjyNJrO8MpGZtkE%3D&reserved=0> > > Changes are > - new class in kll/include/kll_sketch.hpp, w/ associated constructor in > kll/include/kll_sketch_impl.hpp. This class spawns a specified number of > sketches. > - new Python interface functions in python/src/kll_wrapper.cpp > > The only new dependency introduced is the pybind11/numpy.h header file. > The new Numpy-compatible Python classes retain identical functionality to > the existing classes (with minor changes to method names, e.g., > get_min_value --> get_min_values), except that I have not yet implemented > merging or (de)serialization. These would be straight-forward to > implement, if needed. > > Re: characterization tests, I'll take a look at those tests you linked to > and see about running them, time and compute resources permitting. > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Sunday, May 10, 2020 5:32 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Is there a place on GitHub somewhere where I could look at your code so > far? The reason I ask, is before you do a PR, we would like to determine > where a contribution such as this should be placed. > > Our library is split up among different repositories, determined by > language and dependencies. This keeps the user downloads smaller and more > focused. We have two library repos for the core sketch algorithms, one > for Java and one for C++/Python, where the dependencies are very lean, > which simplifies integration into other systems. We have separate repos > for adaptors, which depend on one of the core repos. On the Java side, we > have separate repos for adaptors for Apache Hive and Apache Pig, as the > dependencies for each of these are quite large. For C++, we have a > dedicated repo for the adaptors for PostgreSQL. > > Some of our adaptors are hosted with the target system. For example, our > Druid adaptors were contributed directly into Apache Druid. > > I assume your code has dependencies on Python, NumPy and DataSketches-cpp. > It is not clear to me at the moment whether we should create a separate > repo for this or have a separate group of directories in our cpp repo. > > **** > We have a separate repo for our characterization code, which is not > formally "released" as an Apache release. It exists because we want others > to be able to reproduce (or challenge) our claims of speed performance or > accuracy. It is the one repo where we have all languages and many > different dependencies. The coding style is not as rigorous or as well > documented as our repos that do have formal releases. > > Characterization testing is distinctly different from Unit Tests, which > basically checks all the main code paths and makes sure that the program > works as it should. The key metric is code coverage and Unit Tests should > be fast as it is run on every check-in of new code. Characterization is > also different from Integration Testing, which is testing how well the code > works when integrated into larger systems. > > Characterization tests are unique to our kind of library. Because our > algorithms are probabilistic in nature, in order to verify accuracy or > speed performance we need to run many thousands of trials to eliminate > statistical noise in the results. And when the data is large, this can > take a long time. You can peruse our website for many examples as all the > plots result from various characterization studies. What appears on the > website is but a small fraction of all the testing we have done. > > There are no "standard" tests as every sketch is different so we have to > decide what is important to measure for a particular sketch, but the basic > groups are *speed* and *accuracy*. > > For speed there are many possible measurements, but the basic ones are > update speed, merge speed, Serialization / Deserialization speed, get > estimate or get result speeds. > > For accuracy we want to validate that the sketch is performing within the > bounds of the theoretical error distribution. We want to measure this > accuracy in the context of a stand-alone, purely streaming sketch and also > in the context of merging many sketches together. > > We also try to do these same tests comparing the results against other > alternatives users might have. We have performed these same > characterizations on other publically available sketches as well as against > traditional, brute-force approaches to solving the same problem. > > For the solution you have developed, we would depend on you to decide what > properties would be most important to characterize for users of this > solution. It should be very similar to what you would write in a paper > describing this solution; you want to convince the reader that this is > very useful and why. > > Since the first sketch you have leveraged is the KLL quantiles sketch, I > would think you would want some characterizations similar to what we did > for our studies > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=eVujcwbqKNxloXDHf5nzcAuMBjCCg3ZRDqfM09SsCRk%3D&reserved=0> > comparing our older quantiles sketch and the KLL sketch. > > **** > For the Java characterization tests, we have "standardized" on having > small configuration files which define the key parameters of the test. > These are simple text files > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231874261&sdata=mcLZSrrXzoREYjCQnhSg1GAfPSeJpocrJwBQmVJJzLQ%3D&reserved=0> > of key-value pairs. We don't have any centralized definition of these > pairs, just that they are human readable and intelligible. They are > different for each type of sketch. > > For the C++ tests, we don't have a collection of config files yet (this is > one of our TODOs), but the same kind of parameters are set in the code > itself. > > We will likely want to set up a separate directory for your > characterization tests. > > I hope you find this helpful. > > Cheers, > > Lee. > > On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]> > wrote: > > The code is in a good state now. It can take individual values, lists, or > Numpy arrays as input, and it returns back Numpy arrays. There are some > additional features, like being able to specify which sketches the user > wants to, e.g., get quantiles for. > > But, I have only done minor testing with uniform and normal > distributions. I'd like to put it through more extensive testing (and some > documentation) before releasing it, and it sounds like your > characterization tests are the way to go -- it's not science if it's not > reproducible! Is there a standard set of tests for this purpose? If not, > are there standard tests that have been used for the existing codebase? > > Michael > ------------------------------ > *From:* leerho <[email protected]> > *Sent:* Saturday, May 9, 2020 7:21 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is great. The first step is to get your project working! Once you > think you are ready, it would be really useful if you could do some > characterization testing in the NumPy environment. Characterization tests > are what we run to fully understand how a sketch performs over a range of > parameters and using thousands to millions of trials. You can see some of > the accuracy and speed performance plots of various sketches on our > website. Sometimes these can take hours to run. We typically use > synthetic data to drive our characterization tests to make them > reproducible. > > Real data can also be used and one comparison test I would recommend is > comparing how long it takes to get approximate results using sketches > versus how long it would take to get exact results using brute force > methods. The bigger the data set is the better :) > > We don't have much experience with NumPy so this will be a new environment > for us. But before you get too deep into this please get us involved. We > have been characterizing these streaming algorithms for a number of years, > and would like to help you. > > Cheers, > > Lee. > > On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]> > wrote: > > I'm not quite sure what being a committer entails, but yeah I'm happy to > contribute. I can't commit a lot of time to working on it, but with how > things went for KLL I don't think it will take a lot of time for the other > sketches if they are formatted in a similar manner. Getting this library > integrated into numpy/scipy would be awesome, I'm sure I could get some > others in my field to begin using it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 5:06 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > This is just awesome! Would you be interested in becoming a committer on > our project? It is not automatic, but we could work with you to bring you > up to speed on the other sketches in the library. If you could help us > integrate DataSketches into NumPy and possibly SciPy (not sure if this is > necessary) it would be a very significant contribution and we would > definitely want you to be part of our community! > > Thanks, > > Lee. > > On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > Thanks for the notice, I went ahead and subscribed to the list. > > As for Jon's email, this is actually what I have currently implemented! > Once I finish ironing out a couple improvements, I'm going to move some > code around to follow the existing coding style, put it on Github, and > submit a pull request. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Saturday, May 9, 2020 4:22 PM > *To:* Michael Himes <[email protected]> > *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Hi Michael, > I don't think you saw this email as I doubt you are subscribed to our > [email protected] email list. > > We would like to have you as part of our larger community, as others might > also have suggestions on how to move your project forward. > You can subscribe by sending an empty email to > [email protected]. > > Lee. > > ---------- Forwarded message --------- > From: *Jon Malkin* <[email protected]> > Date: Thu, May 7, 2020 at 4:11 PM > Subject: Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > To: <[email protected]> > Cc: Lee Rhodes <[email protected]>, Edo Liberty < > [email protected]>, [email protected] <[email protected]> > > > We're using pybind11 to get a C++ interface with python (vs raw C). The > wrappers themselves are quite thin, but they do have examples of calling > functions defined in the wrapper as opposed to only the sketch object. > > I believe the easiest way to do this will be to define a pretty simple C++ > object and create a pybind wrapper for it. That object would contain a > std::vector<kll_sketch>. Then you'd define an update method for your > custom object that iterates through a numpy array and calls update() on the > appropriate sketch. You'd also want to define something similar for > get_quantile() or whatever other methods you need that iterates through > that vector of sketches and returns the result in a numpy array. > > That's a pretty lightweight object. And then you'd use a similar thin > pybind wrapper around it to make it play nicely with python. Since our C++ > library is just templates, you'd end up with a free-standing library, with > no requirement that the base datasketches library be involved. > > jon > > On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]> > wrote: > > I would be happy to share whatever I come up with (if anything). The lack > of a Numpy/Scipy implementation is what led me to the DataSketches library, > it would be very useful to myself and others if it were a part of > Numpy/Scipy. > > For what it's worth, passing in a Numpy array and manipulating it from the > C++ side is quite easy. On the other hand, figuring out how to spawn m > sketches and pass the values along to that looks like it'll be more > challenging, there is a lot of code here and it'll take some time for me to > familiarize myself with it. > > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Thursday, May 7, 2020 12:00 PM > *To:* Michael Himes <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]>; [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > If you do figure out how to do this, it would be great if you could share > it with us. We would like to extend it to other sketches and submit it as > an added functionality to NumPy. I have been looking at the NomPy and > SciPy libraries and have not found anything close to what we have. > > Lee. > > > On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]> > wrote: > > Hi Lee, Jon, > > Thanks for the information. I tried to vectorize things this morning and > ran into that exact problem -- since the offsets can differ, it leads to > slices of different lengths, which wouldn't be possible to store as a > single Numpy array. > > Lee, your understanding of my problem is spot on. n vectors of size m, > where all m elements of each vector are a float (no NaNs or missing > values). I am interested in quantiles at rank r for each of the m > streams. Only 1 sketch will operate simultaneously, saving/loading the > sketch is not required (though it would be a nice feature), and sketches > would not need to be merged (no serialization/deserialization). > > Not surprisingly, it looks like your original suggestion of handling this > on the C++ side is the way to go. Once I have time to dive into the code, > my plan is to write something that implements what you described in the > earlier email. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 10:43 PM > *To:* Michael Himes <[email protected]> > *Cc:* [email protected] <[email protected]>; Edo > Liberty <[email protected]>; [email protected] <[email protected]> > > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not > work for another reason and that is for each dimension, choosing whether to > delete the odd or even values in the compactor must be random and > independent of the other dimensions. Otherwise you might get unwanted > correlation effects between the dimensions. > > This is another argument that you should have independent compactors for > each dimension. So you might as well stick with individual sketches for > each dimension. > > Lee. > > On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]> > wrote: > > Michael, > > Allow me to back up for a moment to make sure I understand your problem. > > You have a large number of large vectors of the form *V_n = {x_i}:* *n* > vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th > element, or equivalently, the *i*th dimension. > > Assumptions: > > - All vectors, *V*, are of the same size *m.* > - All elements, *x_i*, are valid numbers of the same type. No missing > values, and if you are using *floats*, this means no *NaN*s. > > In aggregate, the *n* vectors represent *m* *independent* distributions > of values. > > Your task is to be able to obtain *m* quantiles at rank *r* in a single > query. > > **** > To do this, using your idea, would require vectorization of the entire > sketch and not just the compactors. The inputs are vectors, the result of > operations such as getQuantile(r), getQuantileUpperBound(r), > getQuantileLowerBound(r), are also vectors. > > This sketch will be a large data structure, which leads to more questions > ... > > - Do you anticipate having many of these vectorized sketches operating > simultaneously? > - Is there any requirement to store and later retrieve this sketch? > - Or, the nearly equivalent question: Do you require merging of these > sketches (across clusters, for example)? Which also means serialization > and deserialization. > > I am concerned that this vector-quantiles sketch would be limited in the > sense that it may not be as widely applicable as it could be. > > Our experience with real data is that it is ugly with missing values, NaN, > nulls, etc. Which means we would not be able to vectorize the compactor. > Each dimension *i* would need a separate independent compactor because > the compaction times will vary depending on missing values or NaNs in the > data. > > Spacewise, I don't think having separate independent sketches for each > dimension would be much smaller than vectorizing the entire sketch, because > the internals of the existing sketch are already quite space efficient > leveraging compact arrays, etc. > > As a first step I would favor figuring out how to access the NumPy data > structure on the C++ side, having individual sketches for each > dimension, and doing the iterations updating the sketches in C++. It also > has the advantage of leveraging code that exists and it would automatically > be able to leverage any improvements to the sketch code over time. In > addition, it could be a prototype of how to integrate other sketches into > the NumPy ecosystem. > > A fully vectorized sketch would be a separate implementation and would not > be able to take advantage of these points. > > Lee. > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]> > wrote: > > Hi Lee, > > I don't think there is a problem with the DataSketches library, just that > it doesn't support what I am trying to do -- looking in the documentation, > it only supports streams of ints or floats, and those situations work fine > for me. Here's what I did: > - began with the KLL test .py file: > https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=Pq5xyLFYHKAVnD1pUFMy9VpCZ4ceDUKv34f8Afea%2B%2F0%3D&reserved=0> > - replaced line 30 with kll.update(np.ones(10) * randn()) to have a Numpy > array of 10 identical values. > - ran the code > > This leads to the following error, as expected: > TypeError: update(): incompatible function arguments. The following > argument types are supported: > 1. (self: datasketches.kll_floats_sketch, item: float) -> None > > Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, > array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, > -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) > > It's not coded to support Numpy arrays, therefore it complains. What I > would ideally like to have happen in this scenario is it would treat each > element in the array as a separate stream. Then, later when getting a > given quantile, it would give 10 values, one for each stream. I don't see > an easy approach to implementing this on the Python side besides a very > slow iterative approach, and admittedly my C++ is quite rusty so I haven't > looked into the codebase to see how I might modify things there to support > this functionality. > > Re: the streaming-quantiles code being easily modified, I believe the only > necessary changes would be changing the Compactor class to be a subclass of > numpy.ndarray, rather than list, and implementing methods for the > list-specific methods that are used, like .append(). Then, it isn't > necessary to loop over the streams since we can make use of Numpy's > broadcasting, which will handle the looping in its C++ code, as you > mentioned. I'll work on this and see if it really is as straight-forward > as it seems. > > If you have any advice on how to use DataSketches for my problem, I'm > certainly open to that. > > Thanks, > Michael > ------------------------------ > *From:* Lee Rhodes <[email protected]> > *Sent:* Wednesday, May 6, 2020 4:37 PM > *To:* Michael Himes <[email protected]>; [email protected] > <[email protected]> > *Cc:* Edo Liberty <[email protected]>; [email protected] < > [email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > Michael, > > Thank you for considering the DataSketches library. I am adding this > thread to our [email protected] so that our whole team can > contribute to finding a solution for you. > > WRT the error you experienced, please help us help you by sharing with us > what the exact error was. > > We are about to release a major upgrade to the DataSketches C++/Python > product in the next few weeks. We have fixed a number of stability issues > and bugs, which may solve the problem. Nonetheless, we want to work with > you to get your problem solved. > > Updating 1e5 sketches in a system is not a problem in Java or C++. We > have real-time systems today that generate and process over 1e9 sketches > every day. Unfortunately our experience tells us that looping in Python > code will be 10 to 100 times slower than Java or C++. This is because the > code would have to switch from Python to C++ for every vector element. > > By comparison, the streaming-quantiles code could be easily modified to > use Numpy arrays and operate on vectors. > > > I would like to understand more about what you have in mind that would be > "easily modified". > > NumPy achieves its speed performance by doing all of the matrix operations > in pre-compiled C++ code. To achieve best performance, we would want to > read and loop through the NumPy data structure on the C++ side leveraging > the C++ DataSketches library directly. I am not sure what would be > involved to actually accomplish that. > > But first we need to get your Python + NumPy code working correctly with > our library so we can find out what its actual performance is. > > Cheers, > > Lee. > > > > > > On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]> > wrote: > > Hi Edo, Lee, > > Thanks for the prompt response. I looked at the datasketches library, and > while it seems to have a lot more features, it looks like it'll be a lot > more difficult to get it to work for my desired use case. > > My problem is that I need quantiles for each element of a vector (length > on the order of 1e4 -- 1e5), for some finite stream of vectors (on the > order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, > but it throws an error, so it doesn't seem like datasketches handles this > situation currently. > > To use datasketches, I think I would need to instantiate 1 object per > vector element, and I suspect this will slow things down considerably due > to iterating over the objects when each vector is processed. By > comparison, the streaming-quantiles code could be easily modified to use > Numpy arrays and operate on vectors. I ran a few unit tests on both codes > and found equivalent behavior, as expected. > > Do you have any recommendation(s) for this situation? Are there known > limitations of the streaming-quantiles code that would cause issues for my > use case? Are the other methods offered in datasketches 'better' than the > KLL implemented in streaming-quantiles? I'm quite out of my area of > expertise, so I appreciate any advice you can offer, and I will of course > acknowledge it in the publication. > > Best, > Michael > > ------------------------------ > *From:* Edo Liberty <[email protected]> > *Sent:* Tuesday, May 5, 2020 8:09 PM > *To:* Lee Rhodes <[email protected]>; Michael Himes < > [email protected]> > *Cc:* [email protected] <[email protected]> > *Subject:* Re: Permission to use KLL streaming-quantiles code in free > open-source academic software > > +Lee > > Hi Michael, Thanks for reaching out. > While you can certainly do that, I recommend using the python-Binded > datasketches library. It will be more robust, faster, and bug free than my > code :) > > On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote: > > Hi Edo, > > I'm currently working on a Python package for machine-learning-accelerated > exoplanet modeling. It is free and open source (see here if you're curious > https://github.com/exosports/HOMER > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C724003c123964738f5c608d7fc4eec77%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255288231884256&sdata=c4zyquWhfPR3ITQXOF7rGefRz4SWgxI3MT3a3d%2Bvyww%3D&reserved=0>), > and it's meant purely for reproducible academic research. > > I'm adding some new features to the software, and one of them requires > computing quantiles for a data set that cannot fit into memory. After > searching around for different methods to do this, your KLL method seemed > to be a good option in terms of speed and space requirements. > > Rather than reinvent the wheel and code my own implementation of the > method from scratch, I was wondering if you'd be willing to allow me to use > your code? I don't see a license, so I wanted to make sure you're okay > with this. I could implement it as a submodule within my repo, or I could > only include the kll.py file and add some additional comments pointing to > your repo and such, whichever you prefer. > > Best, > Michael > > -- > From my cell phone. > >
