Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael Himes Tue, 19 May 2020 15:33:19 -0700

Great tests (especially with the ordering), Jon!

I did some scaling tests for dimensionality (1, 10, and 100), and this is where 
I think the Numpy version shows its benefits.  I performed a test similar to 
your setup:
- each sketch has k = 160 (unsure what you used for this value, if it matters)
- 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
- get_quantiles(0.5)


d=1    -- 84 s (this is the 123 s case you ran)
d=10   -- 88 s
d=100  --  294 s
d=1000 -- 2298 s (did this one for fun, but there is a lot of variability in 
runtime)

Note that I did not use a single-value method, just the Numpy version.  Also, I 
checked the compute cost of the Python loop, and it's about 1 second, so most 
of that ~80 seconds is the communication between Python and C++.  The scaling 
relation looks to be better than linear, but there needs to be a few more tests 
here to really determine that.

But, as Lee pointed out, there is non-negligible overhead from crossing the 
bridge between Python and C++.  It's small, but when doing it 2^25 times it 
adds up.  The Numpy implementation allows you to cross that bridge much less 
often, albeit at the cost of some extra time programming that part.  If I set 
up a queue that holds 2^5 values and then updates it, it's quite a bit better.  
Here are the results for the same dimensions as before:

d=1   -- 8 s
d=10  -- 31 s
d=100 -- 257 s

So, even with a small queue of 32 values, we see that a single sketch using 
kll_sketches is faster than a kll_sketch by a factor of 2-3.  And with the 
batch set to 2^7 values (this is how I use it in my project):
d=1   -- 4.2 s
d=10  -- 27 s
d=100 -- 251 s

The speed gain doesn't seem to scale with dimensionality, but I think that has 
more to do with the compute overhead of generating the data since Numpy tends 
to be faster when working in 1D vs multiple dimensions.  But we can see that 
it's possible to get runtimes much closer to C++ runtimes than would be 
expected.

Michael
________________________________
From: Jon Malkin <[email protected]>
Sent: Tuesday, May 19, 2020 4:58 PM
To: [email protected] <[email protected]>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Well, one thought was maybe we could always use the vectorized kll in python 
and make it (relatively) easy to have it work with only 1 dimension. It looks 
like there's still a non-trivial performance hit from that. But wow.. I 
realized I could try something simple like reversing the declaration order of 
single-update vs vector-update in the wrapper class. And that dropped it to 35s!

With that, it may be worth exploring a unified wrapper that handles single 
items or vectors.

  jon

On Tue, May 19, 2020 at 1:52 PM leerho 
<[email protected]<mailto:[email protected]>> wrote:
We had a similar issue in Java trying to use JNI to access C code.  Every 
transition across the "boundary" between Java and C took from 10 to 100 
microseconds.  This made the JNI option pretty useless from our standpoint.

I don't know python that well, but I could well imagine that there may be a 
similar issue here in moving data between Python and C++.

That being said, compared to brute-force computation of these types of queries 
in Python vs using even these (what we consider slow performing) sketches in 
Python still may be a huge win.

Lee.



On Tue, May 19, 2020 at 1:28 PM Jon Malkin 
<[email protected]<mailto:[email protected]>> wrote:
I tried comparing the performance of the existing floats sketch vs the new 
thing with a single dimension. And then I made a second update method that 
handles a single item rather than creating an array of length 1 each time. 
Otherwise, the scripts were as identical as possible. I fed in 2^25 
gaussian-distributed values and queried for the mean to force some computation 
on the sketch. I think get_quantile(0.5) vs get_quantiles(0.5)[0][0] was the 
only difference,

Existing kll_floats_sketch: 31s
kll_floatarray_sketches: 123s
with single-item update: 80s

Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse RNG so 
this seemed more fair)

I didn't try anything with trying to batch updates, even though in theory the 
new object can support that. This was more a test to see the performance impact 
of using it for all kll sketches.

At some level, if you're already ok taking the speed hit for python vs C++ then 
maybe it doesn't matter. But >2x still seems significant to me.

  jon

On Thu, May 14, 2020 at 6:54 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Great, I'll be submitting the pull request shortly.  The codebase I'm working 
with doesn't have any of the changes made in the past week or so, hopefully 
that isn't too much of a hassle to merge.

As an aside, my employer encourages us to contribute code to libraries like 
this, so I'm happy to work on additional features for the Python interface as 
needed.

Michael
________________________________
From: Jon Malkin <[email protected]<mailto:[email protected]>>
Sent: Thursday, May 14, 2020 6:56 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

We've been polishing things up for a release, so that was one of several things 
that we fixed over the last several days. Thank you for finding it!

Anyway, if you're generally happy with the state of things (and are allowed to 
under any employment terms), I'd encourage you to create pull request to merge 
your changes into the main repo. It doesn't need to be perfect as we can always 
make changes as part of the PR review or post-merge.

Thanks,
  jon


On Mon, May 11, 2020 at 2:25 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Thanks for taking a look, Jon.

I pushed an update that address 2 & 4.

#3 is actually something I had a question about. I've tested passing numpy.nan 
into the update function, and it doesn't appear to break anything (min, max, 
etc all still work correctly).  However, the reported number of items per 
sketch counts the nan entries.  Is this the expected behavior, or should the 
get_n() method return a number that does not count the nans it has seen?  I 
expected the latter, so I'm worried that numpy's nan is being treated 
differently.

Michael
________________________________
From: Jon Malkin <[email protected]<mailto:[email protected]>>
Sent: Monday, May 11, 2020 4:32 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

I didn't look in super close detail, but the code overall looks pretty good. 
Comments are below.

Note that not all of these necessarily need changes or replies. I'm just trying 
to document things we'll want to think about for keeping the library 
general-purpose (and we can always make changes after merging, of course).

1. I worry the name kll_sketches is confusingly similar to kll_sketch. Maybe 
vector_kll_sketches? But if there's a way to extend KLL in the future to 
operate on an entire vector at a time (vs treating each dimension 
independently) that'd become confusing. I think an inherently vectorized 
version would be a very different beast, but I always worry I'm not being 
imaginative enough. If merging into the Apache codebase, I'd probably wait to 
see what the file looks like with the renaming before a final decision on 
moving to its own file.

2. What happens if the input to update() has >2 dimensions? If that'd be 
invalid, we should explicitly check and complain. If it'll Do The Right Thing 
by operating on the first 2 dimensions (meaning correct indices) that's fine, 
but otherwise should probably complain.

3. Can this handle sparse input vectors? Not sure how important that is in 
general, even if your project doesn't require it. kll_sketch will ignore NaNs, 
so those appearing would mean the number of items per sketch can already differ.

4. I'd probably eat the very slightly increased space and go with 32 bits for 
the number of dimensions (aka number of sketches). If trying to look at a 
distribution of values for some machine learning application, it'd be easy to 
overflow 65k dimensions for some tasks.

5. I imagine you've realized that it's easiest to do unit tests from python in 
this case. That's another advantage of having this live in the wrapper.

6. Finally, that assert issue is already obsolete :). Asserts were converted 
if/throw exceptions late last week. It'll be flagged as a conflict in merging, 
so no worries for now.

Looking good at this point. And as I said, not all of these need changes or 
comments from you.

  jon

On Mon, May 11, 2020 at 7:09 AM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Understood, I went ahead and moved the new class to the kll_wrapper.cpp file -- 
I'll leave it to you to decide if it's better as its own file.

Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0 throws 
errors regarding the assert calls in kll_sketch_impl.hpp.  I added an include 
of assert.h there and then it compiled without issue.  It's possible that other 
compilers will also complain about that, so maybe this is a good update to the 
main branch.

Michael
________________________________
From: Jon Malkin <[email protected]<mailto:[email protected]>>
Sent: Sunday, May 10, 2020 10:47 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

My only comment without having looked at actual code is that the new class 
would be more appropriate in the python wrapper. Maybe even drop it in as it's 
own file, as that would decrease recompile time a bit when debugging (that's 
pybind's suggestion, anyway). Probably not a huge difference with how light 
these wrappers are.

If this is something that becomes widely used, to where we look at pushing it 
into the base library, we'd look at whether we could share any data across 
sketches. But we're far from that point currently. It'd be nice to need to 
consider that.

  jon

On Sun, May 10, 2020, 7:33 PM leerho 
<[email protected]<mailto:[email protected]>> wrote:
Michael,  this has been a great interchange and certainly will allow us to move 
forward more quickly.

Thank you for working on this on a Mother's Day Sunday!

I'm sure Alex and Jon may have more questions, when they get a chance to look 
at it starting tomorrow.

Cheers, and be safe and well!

Lee.

On Sun, May 10, 2020 at 6:25 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Re: testing, so far I've just done glorified unit tests for uniform and normal 
distributions of varying sizes.  I plan to do some timing tests vs the existing 
single-sketch Python class to see how it compares for 1, 10, and 100 streams.

1. That makes sense.  One option to allow full Numpy compatibility but without 
requiring a Python user to use Numpy would be to return everything as lists, 
rather than Numpy arrays.  Numpy users could then convert those lists into 
arrays, and non-Numpy users would be unaffected (aside from needing the 
pybind11/numpy.h header).  Alternatively, some flag could be set when 
instantiating the object that would control whether things are returned as 
lists or arrays, though this still requires the numpy.h header file.

2. I didn't change the kll_sketch code, I only defined a new (wrapper) class 
called kll_sketches, which spawns a user-specified number of sketches.  Each of 
those sketches are kll_sketch objects and uses all of the existing code for 
that.  For fast execution in Python, the parallel sketches must be spawned in 
C++, but the existing Python object could only spawn a single sketch since it 
wraps the kll_sketch class.  Perhaps the kll_sketches class would be better 
placed in the python/src/kll_wrapper.cpp file?  I suppose you wouldn't need 
this class if you weren't using Python.

3. Yes, SerDe is very straight-forward here.  I've marked some stuff as todo's, 
and that is one of them -- the plan is to do like you described and call the 
relevant kll_sketch method on each of the sketches and return that to Python in 
a sensible format.  For deserialization, it would just iterate through them and 
load them into the kll_sketches object.  I don't require it for my project, so 
I didn't bother to wrap that yet -- I'll take a look sometime this week after I 
finish my work for the day, shouldn't take long to do.

4. That makes sense.  Does using Numpy complicate that at all?  My thought is 
that since under the hood everything is using the existing kll_sketch class, it 
would have full compatibility with the rest of the library (once SerDe is added 
in).

Michael
________________________________
From: leerho <[email protected]<mailto:[email protected]>>
Sent: Sunday, May 10, 2020 8:42 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Thanks for the link to your code.  My colleagues, Jon and Alex, will take a 
closer look this next week.  They wrote this code so they are much closer to it 
than I.

What you have done so far makes sense for you as you want to get this working 
in the NumPy environment as quickly as possible.  As soon as we start thinking 
about incorporating this into our library other concerns become important.

1. Adding API calls is the recommended way to add functionality (like NumPy) to 
a library.  We cannot change API calls in a way that is only useful with NumPy, 
because it would seriously impact other users of the library that don't need 
NumPy.  If both sets of calls cannot simultaneously exist in the same sketch 
API, then we need to consider other alternatives.

2.  Based on our previous discussions, I didn't envision that you would have to 
change the kll_sketch code itself other than perhaps a "wrapper" class that 
enables vectorized input to a vector of sketches and a vectorized get result 
that creates a vector result from a vector of sketches.  This would isolate the 
changes you need for NumPy from the sketch itself.  This is also much easier to 
support, maintain and debug.

3. If you don't change the internals of the sketch then SerDe becomes pretty 
straightforward. I don't know if you need a single serialization that 
represents a full vector of sketches,  but if you do, then I would just iterate 
over the individual serdes and figure out how to package it.  I really don't 
think you want to have to rewrite this low-level stuff.

4. Binary compatibility is critically important for us and I think will be 
important for you as well.  There are two dimensions of binary compatibility: 
history and language.  This means that a kll sketch serialized from Java, can 
be successfully read by C++ and visa versa.  Similarly, a kll sketch serialized 
today will be able to be read many years from now.     Another aspect of this 
would mean being able to collect, say, 100 sketches that were not created using 
the NumPy version, and being able to put them together in a NumPy vector; and 
visa versa.

I hope all of this make sense to you.

Cheers,

Lee.



On Sun, May 10, 2020 at 4:21 PM leerho 
<[email protected]<mailto:[email protected]>> wrote:
Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Lee,

Thanks for all of that information, it's quite helpful to get a better 
understanding of things.

I've put the code on Github if you'd like to take a look: 
https://github.com/mdhimes/incubator-datasketches-cpp<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0>

Changes are
- new class in kll/include/kll_sketch.hpp, w/ associated constructor in 
kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of 
sketches.
- new Python interface functions in python/src/kll_wrapper.cpp

The only new dependency introduced is the pybind11/numpy.h header file.  The 
new Numpy-compatible Python classes retain identical functionality to the 
existing classes (with minor changes to method names, e.g., get_min_value --> 
get_min_values), except that I have not yet implemented merging or 
(de)serialization.  These would be straight-forward to implement, if needed.

Re: characterization tests, I'll take a look at those tests you linked to and 
see about running them, time and compute resources permitting.

Michael
________________________________
From: leerho <[email protected]<mailto:[email protected]>>
Sent: Sunday, May 10, 2020 5:32 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Michael,

Is there a place on GitHub somewhere where I could look at your code so far?  
The reason I ask, is before you do a PR, we would like to determine where a 
contribution such as this should be placed.

Our library is split up among different repositories, determined by language 
and dependencies.  This keeps the user downloads smaller and more focused.   We 
have two library repos for the core sketch algorithms, one for Java and one for 
C++/Python, where the dependencies are very lean, which simplifies integration 
into other systems.  We have separate repos for adaptors, which depend on one 
of the core repos. On the Java side, we have separate repos for adaptors for 
Apache Hive and Apache Pig, as the dependencies for each of these are quite 
large.  For C++, we have a dedicated repo for the adaptors for PostgreSQL.

Some of our adaptors are hosted with the target system.  For example, our Druid 
adaptors were contributed directly into Apache Druid.

I assume your code has dependencies on Python, NumPy and DataSketches-cpp. It 
is not clear to me at the moment whether we should create a separate repo for 
this or have a separate group of directories in our cpp repo.

****
We have a separate repo for our characterization code, which is not formally 
"released" as an Apache release.  It exists because we want others to be able 
to reproduce (or challenge) our claims of speed performance or accuracy.  It is 
the one repo where we have all languages and many different dependencies.  The 
coding style is not as rigorous or as well documented as our repos that do have 
formal releases.

Characterization testing is distinctly different from Unit Tests, which 
basically checks all the main code paths and makes sure that the program works 
as it should.  The key metric is code coverage and Unit Tests should be fast as 
it is run on every check-in of new code.  Characterization is also different 
from Integration Testing, which is testing how well the code works when 
integrated into larger systems.

Characterization tests are unique to our kind of library. Because our 
algorithms are probabilistic in nature, in order to verify accuracy or speed 
performance we need to run many thousands of trials to eliminate statistical 
noise in the results.  And when the data is large, this can take a long time.  
You can peruse our website for many examples as all the plots result from 
various characterization studies.  What appears on the website is but a small 
fraction of all the testing we have done.

There are no "standard" tests as every sketch is different so we have to decide 
what is important to measure for a particular sketch, but the basic groups are 
speed and accuracy.

For speed there are many possible measurements, but the basic ones are update 
speed, merge speed, Serialization / Deserialization speed, get estimate or get 
result speeds.

For accuracy we want to validate that the sketch is performing within the 
bounds of the theoretical error distribution.  We want to measure this accuracy 
in the context of a stand-alone, purely streaming sketch and also in the 
context of merging many sketches together.

We also try to do these same tests comparing the results against other 
alternatives users might have.  We have performed these same characterizations 
on other publically available sketches as well as against traditional, 
brute-force approaches to solving the same problem.

For the solution you have developed, we would depend on you to decide what 
properties would be most important to characterize for users of this solution.  
It should be very similar to what you would write in a paper describing this 
solution;  you want to convince the reader that this is very useful and why.

Since the first sketch you have leveraged is the KLL quantiles sketch, I would 
think you would want some characterizations similar to what we did for our 
studies<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0>
 comparing our older quantiles sketch and the KLL sketch.

****
For the Java characterization tests, we have "standardized" on having small 
configuration files which define the key parameters of the test.  These are 
simple text 
files<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0>
 of key-value pairs.  We don't have any centralized definition of these pairs, 
just that they are human readable and intelligible.  They are different for 
each type of sketch.

For the C++ tests, we don't have a collection of config files yet (this is one 
of our TODOs), but the same kind of parameters are set in the code itself.

We will likely want to set up a separate directory for your characterization 
tests.

I hope you find this helpful.

Cheers,

Lee.

On Sun, May 10, 2020 at 10:05 AM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
The code is in a good state now.  It can take individual values, lists, or 
Numpy arrays as input, and it returns back Numpy arrays.  There are some 
additional features, like being able to specify which sketches the user wants 
to, e.g., get quantiles for.

But, I have only done minor testing with uniform and normal distributions.  I'd 
like to put it through more extensive testing (and some documentation) before 
releasing it, and it sounds like your characterization tests are the way to go 
-- it's not science if it's not reproducible!  Is there a standard set of tests 
for this purpose?  If not, are there standard tests that have been used for the 
existing codebase?

Michael
________________________________
From: leerho <[email protected]<mailto:[email protected]>>
Sent: Saturday, May 9, 2020 7:21 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

This is great.  The first step is to get your project working!  Once you think 
you are ready, it would be really useful if you could do some characterization 
testing in the NumPy environment. Characterization tests are what we run to 
fully understand how a sketch performs over a range of parameters and using 
thousands to millions of trials.  You can see some of the accuracy and speed 
performance plots of various sketches on our website.  Sometimes these can take 
hours to run.  We typically use synthetic data to drive our characterization 
tests to make them reproducible.

Real data can also be used and one comparison test I would recommend is 
comparing how long it takes to get approximate results using sketches versus 
how long it would take to get exact results using brute force methods.  The 
bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment for 
us.  But before you get too deep into this please get us involved.  We have 
been characterizing these streaming algorithms for a number of years, and would 
like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
I'm not quite sure what being a committer entails, but yeah I'm happy to 
contribute.  I can't commit a lot of time to working on it, but with how things 
went for KLL I don't think it will take a lot of time for the other sketches if 
they are formatted in a similar manner.  Getting this library integrated into 
numpy/scipy would be awesome, I'm sure I could get some others in my field to 
begin using it.

Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

This is just awesome!   Would you be interested in becoming a committer on our 
project?  It is not automatic, but we could work with you to bring you up to 
speed on the other sketches in the library.  If you could help us integrate 
DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it 
would be a very significant contribution and we would definitely want you to be 
part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I 
finish ironing out a couple improvements, I'm going to move some code around to 
follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free 
open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our 
[email protected]<mailto:[email protected]> email list.

We would like to have you as part of our larger community, as others might also 
have suggestions on how to move your project forward.
You can subscribe by sending an empty email to 
[email protected]<mailto:[email protected]>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <[email protected]<mailto:[email protected]>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software
To: <[email protected]<mailto:[email protected]>>
Cc: Lee Rhodes <[email protected]<mailto:[email protected]>>, Edo 
Liberty <[email protected]<mailto:[email protected]>>, 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>


We're using pybind11 to get a C++ interface with python (vs raw C). The 
wrappers themselves are quite thin, but they do have examples of calling 
functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ 
object and create a pybind wrapper for it.  That object would contain a 
std::vector<kll_sketch>.  Then you'd define an update method for your custom 
object that iterates through a numpy array and calls update() on the 
appropriate sketch. You'd also want to define something similar for 
get_quantile() or whatever other methods you need that iterates through that 
vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind 
wrapper around it to make it play nicely with python. Since our C++ library is 
just templates, you'd end up with a free-standing library, with no requirement 
that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a 
Numpy/Scipy implementation is what led me to the DataSketches library, it would 
be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ 
side is quite easy.  On the other hand, figuring out how to spawn m sketches 
and pass the values along to that looks like it'll be more challenging, there 
is a lot of code here and it'll take some time for me to familiarize myself 
with it.

Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

If you do figure out how to do this, it would be great if you could share it 
with us.  We would like to extend  it to other sketches and submit it as an 
added functionality to NumPy.  I have been looking at the NomPy and SciPy 
libraries and have not found anything close to what we have.

Lee.


On Thu, May 7, 2020 at 7:08 AM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran 
into that exact problem -- since the offsets can differ, it leads to slices of 
different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where 
all m elements of each vector are a float (no NaNs or missing values).  I am 
interested in quantiles at rank r for each of the m streams.  Only 1 sketch 
will operate simultaneously, saving/loading the sketch is not required (though 
it would be a nice feature), and sketches would not need to be merged (no 
serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on 
the C++ side is the way to go.  Once I have time to dive into the code, my plan 
is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; Edo Liberty 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work 
for another reason and that is for each dimension, choosing whether to delete 
the odd or even values in the compactor must be random and independent of the 
other dimensions.  Otherwise you might get unwanted correlation effects between 
the dimensions.

This is another argument that you should have independent compactors for each 
dimension.  So you might as well stick with individual sketches for each 
dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes 
<[email protected]<mailto:[email protected]>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of 
size m, where x is a number and x_i is the ith element, or equivalently, the 
ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, 
and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch 
and not just the compactors.  The inputs are vectors, the result of operations 
such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are 
also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating 
simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these 
sketches (across clusters, for example)?  Which also means serialization and 
deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense 
that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, 
nulls, etc.  Which means we would not be able to vectorize the compactor.  Each 
dimension i would need a separate independent compactor because the compaction 
times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each 
dimension would be much smaller than vectorizing the entire sketch, because the 
internals of the existing sketch are already quite space efficient leveraging 
compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data 
structure on the C++ side, having individual sketches for each dimension, and 
doing the iterations updating the sketches in C++.   It also has the advantage 
of leveraging code that exists and it would automatically be able to leverage 
any improvements to the sketch code over time.  In addition, it could be a 
prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be 
able to take advantage of these points.

Lee.

























On Wed, May 6, 2020 at 2:47 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it 
doesn't support what I am trying to do -- looking in the documentation, it only 
supports streams of ints or floats, and those situations work fine for me.  
Here's what I did:
- began with the KLL test .py file: 
https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy 
array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument 
types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, 
array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would 
ideally like to have happen in this scenario is it would treat each element in 
the array as a separate stream.  Then, later when getting a given quantile, it 
would give 10 values, one for each stream.  I don't see an easy approach to 
implementing this on the Python side besides a very slow iterative approach, 
and admittedly my C++ is quite rusty so I haven't looked into the codebase to 
see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only 
necessary changes would be changing the Compactor class to be a subclass of 
numpy.ndarray, rather than list, and implementing methods for the list-specific 
methods that are used, like .append().  Then, it isn't necessary to loop over 
the streams since we can make use of Numpy's broadcasting, which will handle 
the looping in its C++ code, as you mentioned.  I'll work on this and see if it 
really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly 
open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread 
to our [email protected]<mailto:[email protected]> so that 
our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what 
the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product 
in the next few weeks.  We have fixed a number of stability issues and bugs, 
which may solve the problem.  Nonetheless, we want to work with you to get your 
problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have 
real-time systems today that generate and process over 1e9 sketches every day.  
Unfortunately our experience tells us that looping in Python code will be 10 to 
100 times slower than Java or C++.  This is because the code would have to 
switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use 
Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be 
"easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in 
pre-compiled C++ code.  To achieve best performance, we would want to read and 
loop through the NumPy data structure on the C++ side leveraging the C++ 
DataSketches library directly.  I am not sure what would be involved to 
actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our 
library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and 
while it seems to have a lot more features, it looks like it'll be a lot more 
difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the 
order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 
1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an 
error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector 
element, and I suspect this will slow things down considerably due to iterating 
over the objects when each vector is processed.  By comparison, the 
streaming-quantiles code could be easily modified to use Numpy arrays and 
operate on vectors.  I ran a few unit tests on both codes and found equivalent 
behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known 
limitations of the streaming-quantiles code that would cause issues for my use 
case?  Are the other methods offered in datasketches 'better' than the KLL 
implemented in streaming-quantiles?  I'm quite out of my area of expertise, so 
I appreciate any advice you can offer, and I will of course acknowledge it in 
the publication.

Best,
Michael

________________________________
From: Edo Liberty <[email protected]<mailto:[email protected]>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <[email protected]<mailto:[email protected]>>; 
Michael Himes <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded 
datasketches library. It will be more robust, faster, and bug free than my code 
:)

On Tue, May 5, 2020 at 14:11 Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated 
exoplanet modeling.  It is free and open source (see here if you're curious 
https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>),
 and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires 
computing quantiles for a data set that cannot fit into memory.  After 
searching around for different methods to do this, your KLL method seemed to be 
a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method 
from scratch, I was wondering if you'd be willing to allow me to use your code? 
 I don't see a license, so I wanted to make sure you're okay with this.  I 
could implement it as a submodule within my repo, or I could only include the 
kll.py file and add some additional comments pointing to your repo and such, 
whichever you prefer.

Best,
Michael
--
>From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to