Re: Permission to use KLL streaming-quantiles code in free open-source academic software

leerho Sat, 09 May 2020 16:22:17 -0700

This is great.  The first step is to get your project working!  Once you
think you are ready, it would be really useful if you could do some
characterization testing in the NumPy environment. Characterization tests
are what we run to fully understand how a sketch performs over a range of
parameters and using thousands to millions of trials.  You can see some of
the accuracy and speed performance plots of various sketches on our
website.  Sometimes these can take hours to run.  We typically use
synthetic data to drive our characterization tests to make them
reproducible.


Real data can also be used and one comparison test I would recommend is
comparing how long it takes to get approximate results using sketches
versus how long it would take to get exact results using brute force
methods.  The bigger the data set is the better :)

We don't have much experience with NumPy so this will be a new environment
for us.  But before you get too deep into this please get us involved.  We
have been characterizing these streaming algorithms for a number of years,
and would like to help you.

Cheers,

Lee.

On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]> wrote:

> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <[email protected]>; [email protected]
> <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <[email protected]>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> [email protected] email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> [email protected].
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <[email protected]>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <[email protected]>
> Cc: Lee Rhodes <[email protected]>, Edo Liberty <
> [email protected]>, [email protected] <[email protected]>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <[email protected]>
> *Cc:* Edo Liberty <[email protected]>; [email protected] <
> [email protected]>; [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <[email protected]>
> *Cc:* [email protected] <[email protected]>; Edo
> Liberty <[email protected]>; [email protected] <[email protected]>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046519742&sdata=3lw%2BqdC8lUTjPK1fvpsR%2BJvq4GRVd8PfXixmSQcgT90%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <[email protected]>; [email protected]
> <[email protected]>
> *Cc:* Edo Liberty <[email protected]>; [email protected] <
> [email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our [email protected] so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <[email protected]>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <[email protected]>; Michael Himes <
> [email protected]>
> *Cc:* [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C396e269486d0461afa3808d7f45cdf99%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637246552046529736&sdata=2i6eaLbL5ctkgYSIZ2LUTh8S1DQ4cJKS0jDp63i1sA8%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to