Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Michael Himes Sat, 09 May 2020 14:18:17 -0700

I'm not quite sure what being a committer entails, but yeah I'm happy to 
contribute.  I can't commit a lot of time to working on it, but with how things 
went for KLL I don't think it will take a lot of time for the other sketches if 
they are formatted in a similar manner.  Getting this library integrated into 
numpy/scipy would be awesome, I'm sure I could get some others in my field to 
begin using it.

Michael
________________________________
From: Lee Rhodes <[email protected]>
Sent: Saturday, May 9, 2020 5:06 PM
To: Michael Himes <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

This is just awesome!   Would you be interested in becoming a committer on our 
project?  It is not automatic, but we could work with you to bring you up to 
speed on the other sketches in the library.  If you could help us integrate 
DataSketches into NumPy and possibly SciPy (not sure if this is necessary) it 
would be a very significant contribution and we would definitely want you to be 
part of our community!

Thanks,

Lee.

On Sat, May 9, 2020 at 1:41 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee,

Thanks for the notice, I went ahead and subscribed to the list.

As for Jon's email, this is actually what I have currently implemented!  Once I 
finish ironing out a couple improvements, I'm going to move some code around to 
follow the existing coding style, put it on Github, and submit a pull request.

Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Saturday, May 9, 2020 4:22 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Subject: Fwd: Permission to use KLL streaming-quantiles code in free 
open-source academic software

Hi Michael,
I don't think you saw this email as I doubt you are subscribed to our 
[email protected]<mailto:[email protected]> email list.

We would like to have you as part of our larger community, as others might also 
have suggestions on how to move your project forward.
You can subscribe by sending an empty email to 
[email protected]<mailto:[email protected]>.

Lee.

---------- Forwarded message ---------
From: Jon Malkin <[email protected]<mailto:[email protected]>>
Date: Thu, May 7, 2020 at 4:11 PM
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software
To: <[email protected]<mailto:[email protected]>>
Cc: Lee Rhodes <[email protected]<mailto:[email protected]>>, Edo 
Liberty <[email protected]<mailto:[email protected]>>, 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>

We're using pybind11 to get a C++ interface with python (vs raw C). The 
wrappers themselves are quite thin, but they do have examples of calling 
functions defined in the wrapper as opposed to only the sketch object.

I believe the easiest way to do this will be to define a pretty simple C++ 
object and create a pybind wrapper for it.  That object would contain a 
std::vector<kll_sketch>.  Then you'd define an update method for your custom 
object that iterates through a numpy array and calls update() on the 
appropriate sketch. You'd also want to define something similar for 
get_quantile() or whatever other methods you need that iterates through that 
vector of sketches and returns the result in a numpy array.

That's a pretty lightweight object. And then you'd use a similar thin pybind 
wrapper around it to make it play nicely with python. Since our C++ library is 
just templates, you'd end up with a free-standing library, with no requirement 
that the base datasketches library be involved.

  jon

On Thu, May 7, 2020 at 1:08 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
I would be happy to share whatever I come up with (if anything).  The lack of a 
Numpy/Scipy implementation is what led me to the DataSketches library, it would 
be very useful to myself and others if it were a part of Numpy/Scipy.

For what it's worth, passing in a Numpy array and manipulating it from the C++ 
side is quite easy.  On the other hand, figuring out how to spawn m sketches 
and pass the values along to that looks like it'll be more challenging, there 
is a lot of code here and it'll take some time for me to familiarize myself 
with it.

Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Thursday, May 7, 2020 12:00 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

If you do figure out how to do this, it would be great if you could share it 
with us.  We would like to extend  it to other sketches and submit it as an 
added functionality to NumPy.  I have been looking at the NomPy and SciPy 
libraries and have not found anything close to what we have.

Lee.

On Thu, May 7, 2020 at 7:08 AM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee, Jon,

Thanks for the information.  I tried to vectorize things this morning and ran 
into that exact problem -- since the offsets can differ, it leads to slices of 
different lengths, which wouldn't be possible to store as a single Numpy array.

Lee, your understanding of my problem is spot on.  n vectors of size m, where 
all m elements of each vector are a float (no NaNs or missing values).  I am 
interested in quantiles at rank r for each of the m streams.  Only 1 sketch 
will operate simultaneously, saving/loading the sketch is not required (though 
it would be a nice feature), and sketches would not need to be merged (no 
serialization/deserialization).

Not surprisingly, it looks like your original suggestion of handling this on 
the C++ side is the way to go.  Once I have time to dive into the code, my plan 
is to write something that implements what you described in the earlier email.

Thanks,
Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Wednesday, May 6, 2020 10:43 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; Edo Liberty 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>

Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not work 
for another reason and that is for each dimension, choosing whether to delete 
the odd or even values in the compactor must be random and independent of the 
other dimensions.  Otherwise you might get unwanted correlation effects between 
the dimensions.

This is another argument that you should have independent compactors for each 
dimension.  So you might as well stick with individual sketches for each 
dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes 
<[email protected]<mailto:[email protected]>> wrote:
Michael,

Allow me to back up for a moment to make sure I understand your problem.

You have a large number of large vectors of the form V_n = {x_i}:  n vectors of 
size m, where x is a number and x_i is the ith element, or equivalently, the 
ith dimension.

Assumptions:

  *   All vectors, V, are of the same size m.
  *   All elements, x_i, are valid numbers of the same type. No missing values, 
and if you are using floats, this means no NaNs.

In aggregate, the n vectors represent m independent distributions of values.

Your task is to be able to obtain m quantiles at rank r in a single query.

****
To do this, using your idea, would require vectorization of the entire sketch 
and not just the compactors.  The inputs are vectors, the result of operations 
such as getQuantile(r), getQuantileUpperBound(r), getQuantileLowerBound(r), are 
also vectors.

This sketch will be a large data structure, which leads to more questions ...

  *   Do you anticipate having many of these vectorized sketches operating 
simultaneously?
  *   Is there any requirement to store and later retrieve this sketch?
  *   Or, the nearly equivalent question: Do you require merging of these 
sketches (across clusters, for example)?  Which also means serialization and 
deserialization.

I am concerned that this vector-quantiles sketch would be limited in the sense 
that it may not be as widely applicable as it could be.

Our experience with real data is that it is ugly with missing values, NaN, 
nulls, etc.  Which means we would not be able to vectorize the compactor.  Each 
dimension i would need a separate independent compactor because the compaction 
times will vary depending on missing values or NaNs in the data.

Spacewise, I don't think having separate independent sketches for each 
dimension would be much smaller than vectorizing the entire sketch, because the 
internals of the existing sketch are already quite space efficient leveraging 
compact arrays, etc.

As a first step I would favor figuring out how to access the NumPy data 
structure on the C++ side, having individual sketches for each dimension, and 
doing the iterations updating the sketches in C++.   It also has the advantage 
of leveraging code that exists and it would automatically be able to leverage 
any improvements to the sketch code over time.  In addition, it could be a 
prototype of how to integrate other sketches into the NumPy ecosystem.

A fully vectorized sketch would be a separate implementation and would not be 
able to take advantage of these points.

Lee.

On Wed, May 6, 2020 at 2:47 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lee,

I don't think there is a problem with the DataSketches library, just that it 
doesn't support what I am trying to do -- looking in the documentation, it only 
supports streams of ints or floats, and those situations work fine for me.  
Here's what I did:
- began with the KLL test .py file: 
https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046519742&sdata=3lw%2BqdC8lUTjPK1fvpsR%2BJvq4GRVd8PfXixmSQcgT90%3D&reserved=0>
- replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy 
array of 10 identical values.
- ran the code

This leads to the following error, as expected:
TypeError: update(): incompatible function arguments. The following argument 
types are supported:
    1. (self: datasketches.kll_floats_sketch, item: float) -> None

Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, 
array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
       -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])

It's not coded to support Numpy arrays, therefore it complains.  What I would 
ideally like to have happen in this scenario is it would treat each element in 
the array as a separate stream.  Then, later when getting a given quantile, it 
would give 10 values, one for each stream.  I don't see an easy approach to 
implementing this on the Python side besides a very slow iterative approach, 
and admittedly my C++ is quite rusty so I haven't looked into the codebase to 
see how I might modify things there to support this functionality.

Re: the streaming-quantiles code being easily modified, I believe the only 
necessary changes would be changing the Compactor class to be a subclass of 
numpy.ndarray, rather than list, and implementing methods for the list-specific 
methods that are used, like .append().  Then, it isn't necessary to loop over 
the streams since we can make use of Numpy's broadcasting, which will handle 
the looping in its C++ code, as you mentioned.  I'll work on this and see if it 
really is as straight-forward as it seems.

If you have any advice on how to use DataSketches for my problem, I'm certainly 
open to that.

Thanks,
Michael
________________________________
From: Lee Rhodes <[email protected]<mailto:[email protected]>>
Sent: Wednesday, May 6, 2020 4:37 PM
To: Michael Himes <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Cc: Edo Liberty <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

Michael,

Thank you for considering the DataSketches library.   I am adding this thread 
to our [email protected]<mailto:[email protected]> so that 
our whole team can contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us what 
the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python product 
in the next few weeks.  We have fixed a number of stability issues and bugs, 
which may solve the problem.  Nonetheless, we want to work with you to get your 
problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We have 
real-time systems today that generate and process over 1e9 sketches every day.  
Unfortunately our experience tells us that looping in Python code will be 10 to 
100 times slower than Java or C++.  This is because the code would have to 
switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use 
Numpy arrays and operate on vectors.

I would like to understand more about what you have in mind that would be 
"easily modified".

NumPy achieves its speed performance by doing all of the matrix operations in 
pre-compiled C++ code.  To achieve best performance, we would want to read and 
loop through the NumPy data structure on the C++ side leveraging the C++ 
DataSketches library directly.  I am not sure what would be involved to 
actually accomplish that.

But first we need to get your Python + NumPy code working correctly with our 
library so we can find out what its actual performance is.

Cheers,

Lee.

On Wed, May 6, 2020 at 12:10 PM Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Edo, Lee,

Thanks for the prompt response.  I looked at the datasketches library, and 
while it seems to have a lot more features, it looks like it'll be a lot more 
difficult to get it to work for my desired use case.

My problem is that I need quantiles for each element of a vector (length on the 
order of 1e4 -- 1e5), for some finite stream of vectors (on the order of 1e6 -- 
1e8).  I tried using datasketches's KLL with Numpy arrays, but it throws an 
error, so it doesn't seem like datasketches handles this situation currently.

To use datasketches, I think I would need to instantiate 1 object per vector 
element, and I suspect this will slow things down considerably due to iterating 
over the objects when each vector is processed.  By comparison, the 
streaming-quantiles code could be easily modified to use Numpy arrays and 
operate on vectors.  I ran a few unit tests on both codes and found equivalent 
behavior, as expected.

Do you have any recommendation(s) for this situation?  Are there known 
limitations of the streaming-quantiles code that would cause issues for my use 
case?  Are the other methods offered in datasketches 'better' than the KLL 
implemented in streaming-quantiles?  I'm quite out of my area of expertise, so 
I appreciate any advice you can offer, and I will of course acknowledge it in 
the publication.

Best,
Michael

________________________________
From: Edo Liberty <[email protected]<mailto:[email protected]>>
Sent: Tuesday, May 5, 2020 8:09 PM
To: Lee Rhodes <[email protected]<mailto:[email protected]>>; 
Michael Himes <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Permission to use KLL streaming-quantiles code in free open-source 
academic software

+Lee

Hi Michael, Thanks for reaching out.
While you can certainly do that, I recommend using the python-Binded 
datasketches library. It will be more robust, faster, and bug free than my code 
:)

On Tue, May 5, 2020 at 14:11 Michael Himes 
<[email protected]<mailto:[email protected]>> wrote:
Hi Edo,

I'm currently working on a Python package for machine-learning-accelerated 
exoplanet modeling.  It is free and open source (see here if you're curious 
https://github.com/exosports/HOMER<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046529736&sdata=2i6eaLbL5ctkgYSIZ2LUTh8S1DQ4cJKS0jDp63i1sA8%3D&reserved=0>),
 and it's meant purely for reproducible academic research.

I'm adding some new features to the software, and one of them requires 
computing quantiles for a data set that cannot fit into memory.  After 
searching around for different methods to do this, your KLL method seemed to be 
a good option in terms of speed and space requirements.

Rather than reinvent the wheel and code my own implementation of the method 
from scratch, I was wondering if you'd be willing to allow me to use your code? 
 I don't see a license, so I wanted to make sure you're okay with this.  I 
could implement it as a submodule within my repo, or I could only include the 
kll.py file and add some additional comments pointing to your repo and such, 
whichever you prefer.

Best,
Michael
--
>From my cell phone.

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to