Excellent work! On Tue, May 19, 2020 at 4:04 PM Jon Malkin <[email protected]> wrote:
> I also used k=160, so in this case we matched nicely. And the bunches of > 2^5 or 2^7 you were testing is exactly what I meant when referring to > batched inputs. So that's good news. > > I'll take a more careful look through the code -- there was something with > update using arrays of templated type A which was always cast to double, > for instance. But this is certainly promising. > > jon > > On Tue, May 19, 2020 at 3:32 PM Michael Himes <[email protected]> > wrote: > >> Great tests (especially with the ordering), Jon! >> >> I did some scaling tests for dimensionality (1, 10, and 100), and this is >> where I think the Numpy version shows its benefits. I performed a test >> similar to your setup: >> - each sketch has k = 160 (unsure what you used for this value, if it >> matters) >> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal) >> - get_quantiles(0.5) >> >> d=1 -- 84 s (this is the 123 s case you ran) >> d=10 -- 88 s >> d=100 -- 294 s >> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability >> in runtime) >> >> Note that I did not use a single-value method, just the Numpy version. >> Also, I checked the compute cost of the Python loop, and it's about 1 >> second, so most of that ~80 seconds is the communication between Python and >> C++. The scaling relation looks to be better than linear, but there needs >> to be a few more tests here to really determine that. >> >> But, as Lee pointed out, there is non-negligible overhead from crossing >> the bridge between Python and C++. It's small, but when doing it 2^25 >> times it adds up. The Numpy implementation allows you to cross that bridge >> much less often, albeit at the cost of some extra time programming that >> part. If I set up a queue that holds 2^5 values and then updates it, it's >> quite a bit better. Here are the results for the same dimensions as before: >> >> d=1 -- 8 s >> d=10 -- 31 s >> d=100 -- 257 s >> >> So, even with a small queue of 32 values, we see that a single sketch >> using kll_sketches is faster than a kll_sketch by a factor of 2-3. And >> with the batch set to 2^7 values (this is how I use it in my project): >> d=1 -- 4.2 s >> d=10 -- 27 s >> d=100 -- 251 s >> >> The speed gain doesn't seem to scale with dimensionality, but I think >> that has more to do with the compute overhead of generating the data since >> Numpy tends to be faster when working in 1D vs multiple dimensions. But we >> can see that it's possible to get runtimes much closer to C++ runtimes than >> would be expected. >> >> Michael >> ------------------------------ >> *From:* Jon Malkin <[email protected]> >> *Sent:* Tuesday, May 19, 2020 4:58 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Well, one thought was maybe we could always use the vectorized kll in >> python and make it (relatively) easy to have it work with only 1 dimension. >> It looks like there's still a non-trivial performance hit from that. But >> wow.. I realized I could try something simple like reversing the >> declaration order of single-update vs vector-update in the wrapper class. >> And that dropped it to 35s! >> >> With that, it may be worth exploring a unified wrapper that handles >> single items or vectors. >> >> jon >> >> On Tue, May 19, 2020 at 1:52 PM leerho <[email protected]> wrote: >> >> We had a similar issue in Java trying to use JNI to access C code. Every >> transition across the "boundary" between Java and C took from 10 to 100 >> microseconds. This made the JNI option pretty useless from our >> standpoint. >> >> I don't know python that well, but I could well imagine that there may be >> a similar issue here in moving data between Python and C++. >> >> That being said, compared to brute-force computation of these types of >> queries in Python vs using even these (what we consider slow performing) >> sketches in Python still may be a huge win. >> >> Lee. >> >> >> >> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <[email protected]> wrote: >> >> I tried comparing the performance of the existing floats sketch vs the >> new thing with a single dimension. And then I made a second update method >> that handles a single item rather than creating an array of length 1 each >> time. Otherwise, the scripts were as identical as possible. I fed in 2^25 >> gaussian-distributed values and queried for the mean to force some >> computation on the sketch. I think get_quantile(0.5) vs >> get_quantiles(0.5)[0][0] was the only difference, >> >> Existing kll_floats_sketch: 31s >> kll_floatarray_sketches: 123s >> with single-item update: 80s >> >> Same test in c++: 1.7s (I can get it to 1.4s but that's using a worse >> RNG so this seemed more fair) >> >> I didn't try anything with trying to batch updates, even though in theory >> the new object can support that. This was more a test to see the >> performance impact of using it for all kll sketches. >> >> At some level, if you're already ok taking the speed hit for python vs >> C++ then maybe it doesn't matter. But >2x still seems significant to me. >> >> jon >> >> On Thu, May 14, 2020 at 6:54 PM Michael Himes <[email protected]> >> wrote: >> >> Great, I'll be submitting the pull request shortly. The codebase I'm >> working with doesn't have any of the changes made in the past week or so, >> hopefully that isn't too much of a hassle to merge. >> >> As an aside, my employer encourages us to contribute code to libraries >> like this, so I'm happy to work on additional features for the Python >> interface as needed. >> >> Michael >> ------------------------------ >> *From:* Jon Malkin <[email protected]> >> *Sent:* Thursday, May 14, 2020 6:56 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> We've been polishing things up for a release, so that was one of several >> things that we fixed over the last several days. Thank you for finding it! >> >> Anyway, if you're generally happy with the state of things (and are >> allowed to under any employment terms), I'd encourage you to create pull >> request to merge your changes into the main repo. It doesn't need to be >> perfect as we can always make changes as part of the PR review or >> post-merge. >> >> Thanks, >> jon >> >> >> On Mon, May 11, 2020 at 2:25 PM Michael Himes <[email protected]> >> wrote: >> >> Thanks for taking a look, Jon. >> >> I pushed an update that address 2 & 4. >> >> #3 is actually something I had a question about. I've tested passing >> numpy.nan into the update function, and it doesn't appear to break anything >> (min, max, etc all still work correctly). However, the reported number of >> items per sketch counts the nan entries. Is this the expected behavior, or >> should the get_n() method return a number that does not count the nans it >> has seen? I expected the latter, so I'm worried that numpy's nan is being >> treated differently. >> >> Michael >> ------------------------------ >> *From:* Jon Malkin <[email protected]> >> *Sent:* Monday, May 11, 2020 4:32 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> I didn't look in super close detail, but the code overall looks pretty >> good. Comments are below. >> >> Note that not all of these necessarily need changes or replies. I'm just >> trying to document things we'll want to think about for keeping the library >> general-purpose (and we can always make changes after merging, of course). >> >> 1. I worry the name kll_sketches is confusingly similar to kll_sketch. >> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future >> to operate on an entire vector at a time (vs treating each dimension >> independently) that'd become confusing. I think an inherently vectorized >> version would be a very different beast, but I always worry I'm not being >> imaginative enough. If merging into the Apache codebase, I'd probably wait >> to see what the file looks like with the renaming before a final decision >> on moving to its own file. >> >> 2. What happens if the input to update() has >2 dimensions? If that'd be >> invalid, we should explicitly check and complain. If it'll Do The Right >> Thing by operating on the first 2 dimensions (meaning correct indices) >> that's fine, but otherwise should probably complain. >> >> 3. Can this handle sparse input vectors? Not sure how important that is >> in general, even if your project doesn't require it. kll_sketch will ignore >> NaNs, so those appearing would mean the number of items per sketch can >> already differ. >> >> 4. I'd probably eat the very slightly increased space and go with 32 bits >> for the number of dimensions (aka number of sketches). If trying to look at >> a distribution of values for some machine learning application, it'd be >> easy to overflow 65k dimensions for some tasks. >> >> 5. I imagine you've realized that it's easiest to do unit tests from >> python in this case. That's another advantage of having this live in the >> wrapper. >> >> 6. Finally, that assert issue is already obsolete :). Asserts were >> converted if/throw exceptions late last week. It'll be flagged as a >> conflict in merging, so no worries for now. >> >> Looking good at this point. And as I said, not all of these need changes >> or comments from you. >> >> jon >> >> On Mon, May 11, 2020 at 7:09 AM Michael Himes <[email protected]> >> wrote: >> >> Understood, I went ahead and moved the new class to the kll_wrapper.cpp >> file -- I'll leave it to you to decide if it's better as its own file. >> >> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 >> throws errors regarding the assert calls in kll_sketch_impl.hpp. I added >> an include of assert.h there and then it compiled without issue. It's >> possible that other compilers will also complain about that, so maybe this >> is a good update to the main branch. >> >> Michael >> ------------------------------ >> *From:* Jon Malkin <[email protected]> >> *Sent:* Sunday, May 10, 2020 10:47 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> My only comment without having looked at actual code is that the new >> class would be more appropriate in the python wrapper. Maybe even drop it >> in as it's own file, as that would decrease recompile time a bit when >> debugging (that's pybind's suggestion, anyway). Probably not a huge >> difference with how light these wrappers are. >> >> If this is something that becomes widely used, to where we look at >> pushing it into the base library, we'd look at whether we could share any >> data across sketches. But we're far from that point currently. It'd be nice >> to need to consider that. >> >> jon >> >> On Sun, May 10, 2020, 7:33 PM leerho <[email protected]> wrote: >> >> Michael, this has been a great interchange and certainly will allow us >> to move forward more quickly. >> >> Thank you for working on this on a Mother's Day Sunday! >> >> I'm sure Alex and Jon may have more questions, when they get a chance to >> look at it starting tomorrow. >> >> Cheers, and be safe and well! >> >> Lee. >> >> On Sun, May 10, 2020 at 6:25 PM Michael Himes <[email protected]> >> wrote: >> >> Re: testing, so far I've just done glorified unit tests for uniform and >> normal distributions of varying sizes. I plan to do some timing tests vs >> the existing single-sketch Python class to see how it compares for 1, 10, >> and 100 streams. >> >> 1. That makes sense. One option to allow full Numpy compatibility but >> without requiring a Python user to use Numpy would be to return everything >> as lists, rather than Numpy arrays. Numpy users could then convert those >> lists into arrays, and non-Numpy users would be unaffected (aside from >> needing the pybind11/numpy.h header). Alternatively, some flag could be >> set when instantiating the object that would control whether things are >> returned as lists or arrays, though this still requires the numpy.h header >> file. >> >> 2. I didn't change the kll_sketch code, I only defined a new (wrapper) >> class called kll_sketches, which spawns a user-specified number of >> sketches. Each of those sketches are kll_sketch objects and uses all of >> the existing code for that. For fast execution in Python, the parallel >> sketches must be spawned in C++, but the existing Python object could only >> spawn a single sketch since it wraps the kll_sketch class. Perhaps the >> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp >> file? I suppose you wouldn't need this class if you weren't using Python. >> >> 3. Yes, SerDe is very straight-forward here. I've marked some stuff as >> todo's, and that is one of them -- the plan is to do like you described and >> call the relevant kll_sketch method on each of the sketches and return that >> to Python in a sensible format. For deserialization, it would just iterate >> through them and load them into the kll_sketches object. I don't require >> it for my project, so I didn't bother to wrap that yet -- I'll take a look >> sometime this week after I finish my work for the day, shouldn't take long >> to do. >> >> 4. That makes sense. Does using Numpy complicate that at all? My >> thought is that since under the hood everything is using the existing >> kll_sketch class, it would have full compatibility with the rest of the >> library (once SerDe is added in). >> >> Michael >> ------------------------------ >> *From:* leerho <[email protected]> >> *Sent:* Sunday, May 10, 2020 8:42 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Thanks for the link to your code. My colleagues, Jon and Alex, will take >> a closer look this next week. They wrote this code so they are much closer >> to it than I. >> >> What you have done so far makes sense for you as you want to get this >> working in the NumPy environment as quickly as possible. As soon as we >> start thinking about incorporating this into our library other concerns >> become important. >> >> 1. Adding API calls is the recommended way to add functionality (like >> NumPy) to a library. We cannot change API calls in a way that is only >> useful with NumPy, because it would seriously impact other users of the >> library that don't need NumPy. If both sets of calls cannot simultaneously >> exist in the same sketch API, then we need to consider other alternatives. >> >> 2. Based on our previous discussions, I didn't envision that you would >> have to change the kll_sketch code itself other than perhaps a "wrapper" >> class that enables vectorized input to a vector of sketches and a >> vectorized get result that creates a vector result from a vector of >> sketches. This would isolate the changes you need for NumPy from the >> sketch itself. This is also much easier to support, maintain and debug. >> >> 3. If you don't change the internals of the sketch then SerDe becomes >> pretty straightforward. I don't know if you need a single serialization >> that represents a full vector of sketches, but if you do, then I would >> just iterate over the individual serdes and figure out how to package it. >> I really don't think you want to have to rewrite this low-level stuff. >> >> 4. Binary compatibility is critically important for us and I think will >> be important for you as well. There are two dimensions of binary >> compatibility: history and language. This means that a kll sketch >> serialized from Java, can be successfully read by C++ and visa versa. >> Similarly, a kll sketch serialized today will be able to be read many years >> from now. Another aspect of this would mean being able to collect, say, >> 100 sketches that were not created using the NumPy version, and being able >> to put them together in a NumPy vector; and visa versa. >> >> I hope all of this make sense to you. >> >> Cheers, >> >> Lee. >> >> >> >> On Sun, May 10, 2020 at 4:21 PM leerho <[email protected]> wrote: >> >> Michael, >> This is great! What testing have you been able to do so far? >> >> >> On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]> >> wrote: >> >> Lee, >> >> Thanks for all of that information, it's quite helpful to get a better >> understanding of things. >> >> I've put the code on Github if you'd like to take a look: >> https://github.com/mdhimes/incubator-datasketches-cpp >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0> >> >> Changes are >> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in >> kll/include/kll_sketch_impl.hpp. This class spawns a specified number of >> sketches. >> - new Python interface functions in python/src/kll_wrapper.cpp >> >> The only new dependency introduced is the pybind11/numpy.h header file. >> The new Numpy-compatible Python classes retain identical functionality to >> the existing classes (with minor changes to method names, e.g., >> get_min_value --> get_min_values), except that I have not yet implemented >> merging or (de)serialization. These would be straight-forward to >> implement, if needed. >> >> Re: characterization tests, I'll take a look at those tests you linked to >> and see about running them, time and compute resources permitting. >> >> Michael >> ------------------------------ >> *From:* leerho <[email protected]> >> *Sent:* Sunday, May 10, 2020 5:32 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Michael, >> >> Is there a place on GitHub somewhere where I could look at your code so >> far? The reason I ask, is before you do a PR, we would like to determine >> where a contribution such as this should be placed. >> >> Our library is split up among different repositories, determined by >> language and dependencies. This keeps the user downloads smaller and more >> focused. We have two library repos for the core sketch algorithms, one >> for Java and one for C++/Python, where the dependencies are very lean, >> which simplifies integration into other systems. We have separate repos >> for adaptors, which depend on one of the core repos. On the Java side, we >> have separate repos for adaptors for Apache Hive and Apache Pig, as the >> dependencies for each of these are quite large. For C++, we have a >> dedicated repo for the adaptors for PostgreSQL. >> >> Some of our adaptors are hosted with the target system. For example, our >> Druid adaptors were contributed directly into Apache Druid. >> >> I assume your code has dependencies on Python, NumPy and >> DataSketches-cpp. It is not clear to me at the moment whether we should >> create a separate repo for this or have a separate group of directories in >> our cpp repo. >> >> **** >> We have a separate repo for our characterization code, which is not >> formally "released" as an Apache release. It exists because we want others >> to be able to reproduce (or challenge) our claims of speed performance or >> accuracy. It is the one repo where we have all languages and many >> different dependencies. The coding style is not as rigorous or as well >> documented as our repos that do have formal releases. >> >> Characterization testing is distinctly different from Unit Tests, which >> basically checks all the main code paths and makes sure that the program >> works as it should. The key metric is code coverage and Unit Tests should >> be fast as it is run on every check-in of new code. Characterization is >> also different from Integration Testing, which is testing how well the code >> works when integrated into larger systems. >> >> Characterization tests are unique to our kind of library. Because our >> algorithms are probabilistic in nature, in order to verify accuracy or >> speed performance we need to run many thousands of trials to eliminate >> statistical noise in the results. And when the data is large, this can >> take a long time. You can peruse our website for many examples as all the >> plots result from various characterization studies. What appears on the >> website is but a small fraction of all the testing we have done. >> >> There are no "standard" tests as every sketch is different so we have to >> decide what is important to measure for a particular sketch, but the basic >> groups are *speed* and *accuracy*. >> >> For speed there are many possible measurements, but the basic ones are >> update speed, merge speed, Serialization / Deserialization speed, get >> estimate or get result speeds. >> >> For accuracy we want to validate that the sketch is performing within the >> bounds of the theoretical error distribution. We want to measure this >> accuracy in the context of a stand-alone, purely streaming sketch and also >> in the context of merging many sketches together. >> >> We also try to do these same tests comparing the results against other >> alternatives users might have. We have performed these same >> characterizations on other publically available sketches as well as against >> traditional, brute-force approaches to solving the same problem. >> >> For the solution you have developed, we would depend on you to decide >> what properties would be most important to characterize for users of this >> solution. It should be very similar to what you would write in a paper >> describing this solution; you want to convince the reader that this is >> very useful and why. >> >> Since the first sketch you have leveraged is the KLL quantiles sketch, I >> would think you would want some characterizations similar to what we did >> for our studies >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0> >> comparing our older quantiles sketch and the KLL sketch. >> >> **** >> For the Java characterization tests, we have "standardized" on having >> small configuration files which define the key parameters of the test. >> These are simple text files >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0> >> of key-value pairs. We don't have any centralized definition of these >> pairs, just that they are human readable and intelligible. They are >> different for each type of sketch. >> >> For the C++ tests, we don't have a collection of config files yet (this >> is one of our TODOs), but the same kind of parameters are set in the code >> itself. >> >> We will likely want to set up a separate directory for your >> characterization tests. >> >> I hope you find this helpful. >> >> Cheers, >> >> Lee. >> >> On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]> >> wrote: >> >> The code is in a good state now. It can take individual values, lists, >> or Numpy arrays as input, and it returns back Numpy arrays. There are some >> additional features, like being able to specify which sketches the user >> wants to, e.g., get quantiles for. >> >> But, I have only done minor testing with uniform and normal >> distributions. I'd like to put it through more extensive testing (and some >> documentation) before releasing it, and it sounds like your >> characterization tests are the way to go -- it's not science if it's not >> reproducible! Is there a standard set of tests for this purpose? If not, >> are there standard tests that have been used for the existing codebase? >> >> Michael >> ------------------------------ >> *From:* leerho <[email protected]> >> *Sent:* Saturday, May 9, 2020 7:21 PM >> *To:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> This is great. The first step is to get your project working! Once you >> think you are ready, it would be really useful if you could do some >> characterization testing in the NumPy environment. Characterization tests >> are what we run to fully understand how a sketch performs over a range of >> parameters and using thousands to millions of trials. You can see some of >> the accuracy and speed performance plots of various sketches on our >> website. Sometimes these can take hours to run. We typically use >> synthetic data to drive our characterization tests to make them >> reproducible. >> >> Real data can also be used and one comparison test I would recommend is >> comparing how long it takes to get approximate results using sketches >> versus how long it would take to get exact results using brute force >> methods. The bigger the data set is the better :) >> >> We don't have much experience with NumPy so this will be a new >> environment for us. But before you get too deep into this please get us >> involved. We have been characterizing these streaming algorithms for a >> number of years, and would like to help you. >> >> Cheers, >> >> Lee. >> >> On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]> >> wrote: >> >> I'm not quite sure what being a committer entails, but yeah I'm happy to >> contribute. I can't commit a lot of time to working on it, but with how >> things went for KLL I don't think it will take a lot of time for the other >> sketches if they are formatted in a similar manner. Getting this library >> integrated into numpy/scipy would be awesome, I'm sure I could get some >> others in my field to begin using it. >> >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Saturday, May 9, 2020 5:06 PM >> *To:* Michael Himes <[email protected]>; [email protected] >> <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> This is just awesome! Would you be interested in becoming a committer >> on our project? It is not automatic, but we could work with you to bring >> you up to speed on the other sketches in the library. If you could help us >> integrate DataSketches into NumPy and possibly SciPy (not sure if this is >> necessary) it would be a very significant contribution and we would >> definitely want you to be part of our community! >> >> Thanks, >> >> Lee. >> >> On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]> >> wrote: >> >> Hi Lee, >> >> Thanks for the notice, I went ahead and subscribed to the list. >> >> As for Jon's email, this is actually what I have currently implemented! >> Once I finish ironing out a couple improvements, I'm going to move some >> code around to follow the existing coding style, put it on Github, and >> submit a pull request. >> >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Saturday, May 9, 2020 4:22 PM >> *To:* Michael Himes <[email protected]> >> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Hi Michael, >> I don't think you saw this email as I doubt you are subscribed to our >> [email protected] email list. >> >> We would like to have you as part of our larger community, as others >> might also have suggestions on how to move your project forward. >> You can subscribe by sending an empty email to >> [email protected]. >> >> Lee. >> >> ---------- Forwarded message --------- >> From: *Jon Malkin* <[email protected]> >> Date: Thu, May 7, 2020 at 4:11 PM >> Subject: Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> To: <[email protected]> >> Cc: Lee Rhodes <[email protected]>, Edo Liberty < >> [email protected]>, [email protected] <[email protected]> >> >> >> We're using pybind11 to get a C++ interface with python (vs raw C). The >> wrappers themselves are quite thin, but they do have examples of calling >> functions defined in the wrapper as opposed to only the sketch object. >> >> I believe the easiest way to do this will be to define a pretty simple >> C++ object and create a pybind wrapper for it. That object would contain a >> std::vector<kll_sketch>. Then you'd define an update method for your >> custom object that iterates through a numpy array and calls update() on the >> appropriate sketch. You'd also want to define something similar for >> get_quantile() or whatever other methods you need that iterates through >> that vector of sketches and returns the result in a numpy array. >> >> That's a pretty lightweight object. And then you'd use a similar thin >> pybind wrapper around it to make it play nicely with python. Since our C++ >> library is just templates, you'd end up with a free-standing library, with >> no requirement that the base datasketches library be involved. >> >> jon >> >> On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]> >> wrote: >> >> I would be happy to share whatever I come up with (if anything). The >> lack of a Numpy/Scipy implementation is what led me to the DataSketches >> library, it would be very useful to myself and others if it were a part of >> Numpy/Scipy. >> >> For what it's worth, passing in a Numpy array and manipulating it from >> the C++ side is quite easy. On the other hand, figuring out how to spawn m >> sketches and pass the values along to that looks like it'll be more >> challenging, there is a lot of code here and it'll take some time for me to >> familiarize myself with it. >> >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Thursday, May 7, 2020 12:00 PM >> *To:* Michael Himes <[email protected]> >> *Cc:* Edo Liberty <[email protected]>; [email protected] < >> [email protected]>; [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> If you do figure out how to do this, it would be great if you could share >> it with us. We would like to extend it to other sketches and submit it as >> an added functionality to NumPy. I have been looking at the NomPy and >> SciPy libraries and have not found anything close to what we have. >> >> Lee. >> >> >> On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]> >> wrote: >> >> Hi Lee, Jon, >> >> Thanks for the information. I tried to vectorize things this morning and >> ran into that exact problem -- since the offsets can differ, it leads to >> slices of different lengths, which wouldn't be possible to store as a >> single Numpy array. >> >> Lee, your understanding of my problem is spot on. n vectors of size m, >> where all m elements of each vector are a float (no NaNs or missing >> values). I am interested in quantiles at rank r for each of the m >> streams. Only 1 sketch will operate simultaneously, saving/loading the >> sketch is not required (though it would be a nice feature), and sketches >> would not need to be merged (no serialization/deserialization). >> >> Not surprisingly, it looks like your original suggestion of handling this >> on the C++ side is the way to go. Once I have time to dive into the code, >> my plan is to write something that implements what you described in the >> earlier email. >> >> Thanks, >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Wednesday, May 6, 2020 10:43 PM >> *To:* Michael Himes <[email protected]> >> *Cc:* [email protected] <[email protected]>; Edo >> Liberty <[email protected]>; [email protected] <[email protected]> >> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Michael, >> >> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will >> not work for another reason and that is for each dimension, choosing >> whether to delete the odd or even values in the compactor must be random >> and independent of the other dimensions. Otherwise you might get unwanted >> correlation effects between the dimensions. >> >> This is another argument that you should have independent compactors for >> each dimension. So you might as well stick with individual sketches for >> each dimension. >> >> Lee. >> >> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]> >> wrote: >> >> Michael, >> >> Allow me to back up for a moment to make sure I understand your problem. >> >> You have a large number of large vectors of the form *V_n = {x_i}:* *n* >> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th >> element, or equivalently, the *i*th dimension. >> >> Assumptions: >> >> - All vectors, *V*, are of the same size *m.* >> - All elements, *x_i*, are valid numbers of the same type. No missing >> values, and if you are using *floats*, this means no *NaN*s. >> >> In aggregate, the *n* vectors represent *m* *independent* distributions >> of values. >> >> Your task is to be able to obtain *m* quantiles at rank *r* in a single >> query. >> >> **** >> To do this, using your idea, would require vectorization of the entire >> sketch and not just the compactors. The inputs are vectors, the result of >> operations such as getQuantile(r), getQuantileUpperBound(r), >> getQuantileLowerBound(r), are also vectors. >> >> This sketch will be a large data structure, which leads to more questions >> ... >> >> - Do you anticipate having many of these vectorized sketches >> operating simultaneously? >> - Is there any requirement to store and later retrieve this sketch? >> - Or, the nearly equivalent question: Do you require merging of these >> sketches (across clusters, for example)? Which also means serialization >> and deserialization. >> >> I am concerned that this vector-quantiles sketch would be limited in the >> sense that it may not be as widely applicable as it could be. >> >> Our experience with real data is that it is ugly with missing values, >> NaN, nulls, etc. Which means we would not be able to vectorize the >> compactor. Each dimension *i* would need a separate independent >> compactor because the compaction times will vary depending on missing >> values or NaNs in the data. >> >> Spacewise, I don't think having separate independent sketches for each >> dimension would be much smaller than vectorizing the entire sketch, because >> the internals of the existing sketch are already quite space efficient >> leveraging compact arrays, etc. >> >> As a first step I would favor figuring out how to access the NumPy data >> structure on the C++ side, having individual sketches for each >> dimension, and doing the iterations updating the sketches in C++. It also >> has the advantage of leveraging code that exists and it would automatically >> be able to leverage any improvements to the sketch code over time. In >> addition, it could be a prototype of how to integrate other sketches into >> the NumPy ecosystem. >> >> A fully vectorized sketch would be a separate implementation and would >> not be able to take advantage of these points. >> >> Lee. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]> >> wrote: >> >> Hi Lee, >> >> I don't think there is a problem with the DataSketches library, just that >> it doesn't support what I am trying to do -- looking in the documentation, >> it only supports streams of ints or floats, and those situations work fine >> for me. Here's what I did: >> - began with the KLL test .py file: >> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0> >> - replaced line 30 with kll.update(np.ones(10) * randn()) to have a >> Numpy array of 10 identical values. >> - ran the code >> >> This leads to the following error, as expected: >> TypeError: update(): incompatible function arguments. The following >> argument types are supported: >> 1. (self: datasketches.kll_floats_sketch, item: float) -> None >> >> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, >> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, >> -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) >> >> It's not coded to support Numpy arrays, therefore it complains. What I >> would ideally like to have happen in this scenario is it would treat each >> element in the array as a separate stream. Then, later when getting a >> given quantile, it would give 10 values, one for each stream. I don't see >> an easy approach to implementing this on the Python side besides a very >> slow iterative approach, and admittedly my C++ is quite rusty so I haven't >> looked into the codebase to see how I might modify things there to support >> this functionality. >> >> Re: the streaming-quantiles code being easily modified, I believe the >> only necessary changes would be changing the Compactor class to be a >> subclass of numpy.ndarray, rather than list, and implementing methods for >> the list-specific methods that are used, like .append(). Then, it isn't >> necessary to loop over the streams since we can make use of Numpy's >> broadcasting, which will handle the looping in its C++ code, as you >> mentioned. I'll work on this and see if it really is as straight-forward >> as it seems. >> >> If you have any advice on how to use DataSketches for my problem, I'm >> certainly open to that. >> >> Thanks, >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Wednesday, May 6, 2020 4:37 PM >> *To:* Michael Himes <[email protected]>; [email protected] >> <[email protected]> >> *Cc:* Edo Liberty <[email protected]>; [email protected] < >> [email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Michael, >> >> Thank you for considering the DataSketches library. I am adding this >> thread to our [email protected] so that our whole team can >> contribute to finding a solution for you. >> >> WRT the error you experienced, please help us help you by sharing with us >> what the exact error was. >> >> We are about to release a major upgrade to the DataSketches C++/Python >> product in the next few weeks. We have fixed a number of stability issues >> and bugs, which may solve the problem. Nonetheless, we want to work with >> you to get your problem solved. >> >> Updating 1e5 sketches in a system is not a problem in Java or C++. We >> have real-time systems today that generate and process over 1e9 sketches >> every day. Unfortunately our experience tells us that looping in Python >> code will be 10 to 100 times slower than Java or C++. This is because the >> code would have to switch from Python to C++ for every vector element. >> >> By comparison, the streaming-quantiles code could be easily modified to >> use Numpy arrays and operate on vectors. >> >> >> I would like to understand more about what you have in mind that would be >> "easily modified". >> >> NumPy achieves its speed performance by doing all of the matrix >> operations in pre-compiled C++ code. To achieve best performance, we would >> want to read and loop through the NumPy data structure on the C++ side >> leveraging the C++ DataSketches library directly. I am not sure what would >> be involved to actually accomplish that. >> >> But first we need to get your Python + NumPy code working correctly with >> our library so we can find out what its actual performance is. >> >> Cheers, >> >> Lee. >> >> >> >> >> >> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]> >> wrote: >> >> Hi Edo, Lee, >> >> Thanks for the prompt response. I looked at the datasketches library, >> and while it seems to have a lot more features, it looks like it'll be a >> lot more difficult to get it to work for my desired use case. >> >> My problem is that I need quantiles for each element of a vector (length >> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the >> order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, >> but it throws an error, so it doesn't seem like datasketches handles this >> situation currently. >> >> To use datasketches, I think I would need to instantiate 1 object per >> vector element, and I suspect this will slow things down considerably due >> to iterating over the objects when each vector is processed. By >> comparison, the streaming-quantiles code could be easily modified to use >> Numpy arrays and operate on vectors. I ran a few unit tests on both codes >> and found equivalent behavior, as expected. >> >> Do you have any recommendation(s) for this situation? Are there known >> limitations of the streaming-quantiles code that would cause issues for my >> use case? Are the other methods offered in datasketches 'better' than the >> KLL implemented in streaming-quantiles? I'm quite out of my area of >> expertise, so I appreciate any advice you can offer, and I will of course >> acknowledge it in the publication. >> >> Best, >> Michael >> >> ------------------------------ >> *From:* Edo Liberty <[email protected]> >> *Sent:* Tuesday, May 5, 2020 8:09 PM >> *To:* Lee Rhodes <[email protected]>; Michael Himes < >> [email protected]> >> *Cc:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> +Lee >> >> Hi Michael, Thanks for reaching out. >> While you can certainly do that, I recommend using the python-Binded >> datasketches library. It will be more robust, faster, and bug free than my >> code :) >> >> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> >> wrote: >> >> Hi Edo, >> >> I'm currently working on a Python package for >> machine-learning-accelerated exoplanet modeling. It is free and open >> source (see here if you're curious https://github.com/exosports/HOMER >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>), >> and it's meant purely for reproducible academic research. >> >> I'm adding some new features to the software, and one of them requires >> computing quantiles for a data set that cannot fit into memory. After >> searching around for different methods to do this, your KLL method seemed >> to be a good option in terms of speed and space requirements. >> >> Rather than reinvent the wheel and code my own implementation of the >> method from scratch, I was wondering if you'd be willing to allow me to use >> your code? I don't see a license, so I wanted to make sure you're okay >> with this. I could implement it as a submodule within my repo, or I could >> only include the kll.py file and add some additional comments pointing to >> your repo and such, whichever you prefer. >> >> Best, >> Michael >> >> -- >> From my cell phone. >> >>
