Re: Permission to use KLL streaming-quantiles code in free open-source academic software

leerho Sun, 10 May 2020 16:22:17 -0700

Michael,
This is great!  What testing have you been able to do so far?


On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]>
wrote:

> Lee,
>
> Thanks for all of that information, it's quite helpful to get a better
> understanding of things.
>
> I've put the code on Github if you'd like to take a look:
> https://github.com/mdhimes/incubator-datasketches-cpp
>
> Changes are
> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
> sketches.
> - new Python interface functions in python/src/kll_wrapper.cpp
>
> The only new dependency introduced is the pybind11/numpy.h header file.
> The new Numpy-compatible Python classes retain identical functionality to
> the existing classes (with minor changes to method names, e.g.,
> get_min_value --> get_min_values), except that I have not yet implemented
> merging or (de)serialization.  These would be straight-forward to
> implement, if needed.
>
> Re: characterization tests, I'll take a look at those tests you linked to
> and see about running them, time and compute resources permitting.
>
> Michael
> ------------------------------
> *From:* leerho <[email protected]>
> *Sent:* Sunday, May 10, 2020 5:32 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Is there a place on GitHub somewhere where I could look at your code so
> far?  The reason I ask, is before you do a PR, we would like to determine
> where a contribution such as this should be placed.
>
> Our library is split up among different repositories, determined by
> language and dependencies.  This keeps the user downloads smaller and more
> focused.   We have two library repos for the core sketch algorithms, one
> for Java and one for C++/Python, where the dependencies are very lean,
> which simplifies integration into other systems.  We have separate repos
> for adaptors, which depend on one of the core repos. On the Java side, we
> have separate repos for adaptors for Apache Hive and Apache Pig, as the
> dependencies for each of these are quite large.  For C++, we have a
> dedicated repo for the adaptors for PostgreSQL.
>
> Some of our adaptors are hosted with the target system.  For example, our
> Druid adaptors were contributed directly into Apache Druid.
>
> I assume your code has dependencies on Python, NumPy and DataSketches-cpp.
> It is not clear to me at the moment whether we should create a separate
> repo for this or have a separate group of directories in our cpp repo.
>
> ****
> We have a separate repo for our characterization code, which is not
> formally "released" as an Apache release.  It exists because we want others
> to be able to reproduce (or challenge) our claims of speed performance or
> accuracy.  It is the one repo where we have all languages and many
> different dependencies.  The coding style is not as rigorous or as well
> documented as our repos that do have formal releases.
>
> Characterization testing is distinctly different from Unit Tests, which
> basically checks all the main code paths and makes sure that the program
> works as it should.  The key metric is code coverage and Unit Tests should
> be fast as it is run on every check-in of new code.  Characterization is
> also different from Integration Testing, which is testing how well the code
> works when integrated into larger systems.
>
> Characterization tests are unique to our kind of library. Because our
> algorithms are probabilistic in nature, in order to verify accuracy or
> speed performance we need to run many thousands of trials to eliminate
> statistical noise in the results.  And when the data is large, this can
> take a long time.  You can peruse our website for many examples as all the
> plots result from various characterization studies.  What appears on the
> website is but a small fraction of all the testing we have done.
>
> There are no "standard" tests as every sketch is different so we have to
> decide what is important to measure for a particular sketch, but the basic
> groups are *speed* and *accuracy*.
>
> For speed there are many possible measurements, but the basic ones are
> update speed, merge speed, Serialization / Deserialization speed, get
> estimate or get result speeds.
>
> For accuracy we want to validate that the sketch is performing within the
> bounds of the theoretical error distribution.  We want to measure this
> accuracy in the context of a stand-alone, purely streaming sketch and also
> in the context of merging many sketches together.
>
> We also try to do these same tests comparing the results against other
> alternatives users might have.  We have performed these same
> characterizations on other publically available sketches as well as against
> traditional, brute-force approaches to solving the same problem.
>
> For the solution you have developed, we would depend on you to decide what
> properties would be most important to characterize for users of this
> solution.  It should be very similar to what you would write in a paper
> describing this solution;  you want to convince the reader that this is
> very useful and why.
>
> Since the first sketch you have leveraged is the KLL quantiles sketch, I
> would think you would want some characterizations similar to what we did
> for our studies
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083763402&sdata=8hoD1LWVL857LCe0nJtGFeF%2FHNk4yVKm%2BYhEuWAtaTc%3D&reserved=0>
> comparing our older quantiles sketch and the KLL sketch.
>
> ****
> For the Java characterization tests, we have "standardized" on having
> small configuration files which define the key parameters of the test.
> These are simple text files
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=%2FbROUq%2B1hItSCprW8BziTzQbM3%2BCHAWInQnXHSDFWeM%3D&reserved=0>
> of key-value pairs.  We don't have any centralized definition of these
> pairs, just that they are human readable and intelligible.  They are
> different for each type of sketch.
>
> For the C++ tests, we don't have a collection of config files yet (this is
> one of our TODOs), but the same kind of parameters are set in the code
> itself.
>
> We will likely want to set up a separate directory for your
> characterization tests.
>
> I hope you find this helpful.
>
> Cheers,
>
> Lee.
>
> On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]>
> wrote:
>
> The code is in a good state now.  It can take individual values, lists, or
> Numpy arrays as input, and it returns back Numpy arrays.  There are some
> additional features, like being able to specify which sketches the user
> wants to, e.g., get quantiles for.
>
> But, I have only done minor testing with uniform and normal
> distributions.  I'd like to put it through more extensive testing (and some
> documentation) before releasing it, and it sounds like your
> characterization tests are the way to go -- it's not science if it's not
> reproducible!  Is there a standard set of tests for this purpose?  If not,
> are there standard tests that have been used for the existing codebase?
>
> Michael
> ------------------------------
> *From:* leerho <[email protected]>
> *Sent:* Saturday, May 9, 2020 7:21 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is great.  The first step is to get your project working!  Once you
> think you are ready, it would be really useful if you could do some
> characterization testing in the NumPy environment. Characterization tests
> are what we run to fully understand how a sketch performs over a range of
> parameters and using thousands to millions of trials.  You can see some of
> the accuracy and speed performance plots of various sketches on our
> website.  Sometimes these can take hours to run.  We typically use
> synthetic data to drive our characterization tests to make them
> reproducible.
>
> Real data can also be used and one comparison test I would recommend is
> comparing how long it takes to get approximate results using sketches
> versus how long it would take to get exact results using brute force
> methods.  The bigger the data set is the better :)
>
> We don't have much experience with NumPy so this will be a new environment
> for us.  But before you get too deep into this please get us involved.  We
> have been characterizing these streaming algorithms for a number of years,
> and would like to help you.
>
> Cheers,
>
> Lee.
>
> On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]>
> wrote:
>
> I'm not quite sure what being a committer entails, but yeah I'm happy to
> contribute.  I can't commit a lot of time to working on it, but with how
> things went for KLL I don't think it will take a lot of time for the other
> sketches if they are formatted in a similar manner.  Getting this library
> integrated into numpy/scipy would be awesome, I'm sure I could get some
> others in my field to begin using it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Saturday, May 9, 2020 5:06 PM
> *To:* Michael Himes <[email protected]>; [email protected]
> <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> This is just awesome!   Would you be interested in becoming a committer on
> our project?  It is not automatic, but we could work with you to bring you
> up to speed on the other sketches in the library.  If you could help us
> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
> necessary) it would be a very significant contribution and we would
> definitely want you to be part of our community!
>
> Thanks,
>
> Lee.
>
> On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee,
>
> Thanks for the notice, I went ahead and subscribed to the list.
>
> As for Jon's email, this is actually what I have currently implemented!
> Once I finish ironing out a couple improvements, I'm going to move some
> code around to follow the existing coding style, put it on Github, and
> submit a pull request.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Saturday, May 9, 2020 4:22 PM
> *To:* Michael Himes <[email protected]>
> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Hi Michael,
> I don't think you saw this email as I doubt you are subscribed to our
> [email protected] email list.
>
> We would like to have you as part of our larger community, as others might
> also have suggestions on how to move your project forward.
> You can subscribe by sending an empty email to
> [email protected].
>
> Lee.
>
> ---------- Forwarded message ---------
> From: *Jon Malkin* <[email protected]>
> Date: Thu, May 7, 2020 at 4:11 PM
> Subject: Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
> To: <[email protected]>
> Cc: Lee Rhodes <[email protected]>, Edo Liberty <
> [email protected]>, [email protected] <[email protected]>
>
>
> We're using pybind11 to get a C++ interface with python (vs raw C). The
> wrappers themselves are quite thin, but they do have examples of calling
> functions defined in the wrapper as opposed to only the sketch object.
>
> I believe the easiest way to do this will be to define a pretty simple C++
> object and create a pybind wrapper for it.  That object would contain a
> std::vector<kll_sketch>.  Then you'd define an update method for your
> custom object that iterates through a numpy array and calls update() on the
> appropriate sketch. You'd also want to define something similar for
> get_quantile() or whatever other methods you need that iterates through
> that vector of sketches and returns the result in a numpy array.
>
> That's a pretty lightweight object. And then you'd use a similar thin
> pybind wrapper around it to make it play nicely with python. Since our C++
> library is just templates, you'd end up with a free-standing library, with
> no requirement that the base datasketches library be involved.
>
>   jon
>
> On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]>
> wrote:
>
> I would be happy to share whatever I come up with (if anything).  The lack
> of a Numpy/Scipy implementation is what led me to the DataSketches library,
> it would be very useful to myself and others if it were a part of
> Numpy/Scipy.
>
> For what it's worth, passing in a Numpy array and manipulating it from the
> C++ side is quite easy.  On the other hand, figuring out how to spawn m
> sketches and pass the values along to that looks like it'll be more
> challenging, there is a lot of code here and it'll take some time for me to
> familiarize myself with it.
>
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Thursday, May 7, 2020 12:00 PM
> *To:* Michael Himes <[email protected]>
> *Cc:* Edo Liberty <[email protected]>; [email protected] <
> [email protected]>; [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> If you do figure out how to do this, it would be great if you could share
> it with us.  We would like to extend  it to other sketches and submit it as
> an added functionality to NumPy.  I have been looking at the NomPy and
> SciPy libraries and have not found anything close to what we have.
>
> Lee.
>
>
> On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee, Jon,
>
> Thanks for the information.  I tried to vectorize things this morning and
> ran into that exact problem -- since the offsets can differ, it leads to
> slices of different lengths, which wouldn't be possible to store as a
> single Numpy array.
>
> Lee, your understanding of my problem is spot on.  n vectors of size m,
> where all m elements of each vector are a float (no NaNs or missing
> values).  I am interested in quantiles at rank r for each of the m
> streams.  Only 1 sketch will operate simultaneously, saving/loading the
> sketch is not required (though it would be a nice feature), and sketches
> would not need to be merged (no serialization/deserialization).
>
> Not surprisingly, it looks like your original suggestion of handling this
> on the C++ side is the way to go.  Once I have time to dive into the code,
> my plan is to write something that implements what you described in the
> earlier email.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Wednesday, May 6, 2020 10:43 PM
> *To:* Michael Himes <[email protected]>
> *Cc:* [email protected] <[email protected]>; Edo
> Liberty <[email protected]>; [email protected] <[email protected]>
>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
> work for another reason and that is for each dimension, choosing whether to
> delete the odd or even values in the compactor must be random and
> independent of the other dimensions.  Otherwise you might get unwanted
> correlation effects between the dimensions.
>
> This is another argument that you should have independent compactors for
> each dimension.  So you might as well stick with individual sketches for
> each dimension.
>
> Lee.
>
> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]>
> wrote:
>
> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Lee,
>
> I don't think there is a problem with the DataSketches library, just that
> it doesn't support what I am trying to do -- looking in the documentation,
> it only supports streams of ints or floats, and those situations work fine
> for me.  Here's what I did:
> - began with the KLL test .py file:
> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083773400&sdata=oVBbHOLxj1c4WzQr%2BZAROfSEZwmvIf4fpPKhVi409cY%3D&reserved=0>
> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a Numpy
> array of 10 identical values.
> - ran the code
>
> This leads to the following error, as expected:
> TypeError: update(): incompatible function arguments. The following
> argument types are supported:
>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>
> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>
> It's not coded to support Numpy arrays, therefore it complains.  What I
> would ideally like to have happen in this scenario is it would treat each
> element in the array as a separate stream.  Then, later when getting a
> given quantile, it would give 10 values, one for each stream.  I don't see
> an easy approach to implementing this on the Python side besides a very
> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
> looked into the codebase to see how I might modify things there to support
> this functionality.
>
> Re: the streaming-quantiles code being easily modified, I believe the only
> necessary changes would be changing the Compactor class to be a subclass of
> numpy.ndarray, rather than list, and implementing methods for the
> list-specific methods that are used, like .append().  Then, it isn't
> necessary to loop over the streams since we can make use of Numpy's
> broadcasting, which will handle the looping in its C++ code, as you
> mentioned.  I'll work on this and see if it really is as straight-forward
> as it seems.
>
> If you have any advice on how to use DataSketches for my problem, I'm
> certainly open to that.
>
> Thanks,
> Michael
> ------------------------------
> *From:* Lee Rhodes <[email protected]>
> *Sent:* Wednesday, May 6, 2020 4:37 PM
> *To:* Michael Himes <[email protected]>; [email protected]
> <[email protected]>
> *Cc:* Edo Liberty <[email protected]>; [email protected] <
> [email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> Michael,
>
> Thank you for considering the DataSketches library.   I am adding this
> thread to our [email protected] so that our whole team can
> contribute to finding a solution for you.
>
> WRT the error you experienced, please help us help you by sharing with us
> what the exact error was.
>
> We are about to release a major upgrade to the DataSketches C++/Python
> product in the next few weeks.  We have fixed a number of stability issues
> and bugs, which may solve the problem.  Nonetheless, we want to work with
> you to get your problem solved.
>
> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
> have real-time systems today that generate and process over 1e9 sketches
> every day.  Unfortunately our experience tells us that looping in Python
> code will be 10 to 100 times slower than Java or C++.  This is because the
> code would have to switch from Python to C++ for every vector element.
>
> By comparison, the streaming-quantiles code could be easily modified to
> use Numpy arrays and operate on vectors.
>
>
> I would like to understand more about what you have in mind that would be
> "easily modified".
>
> NumPy achieves its speed performance by doing all of the matrix operations
> in pre-compiled C++ code.  To achieve best performance, we would want to
> read and loop through the NumPy data structure on the C++ side leveraging
> the C++ DataSketches library directly.  I am not sure what would be
> involved to actually accomplish that.
>
> But first we need to get your Python + NumPy code working correctly with
> our library so we can find out what its actual performance is.
>
> Cheers,
>
> Lee.
>
>
>
>
>
> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
> wrote:
>
> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <[email protected]>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <[email protected]>; Michael Himes <
> [email protected]>
> *Cc:* [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C6e6a337d40964f33d20008d7f529c5fc%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637247432083783391&sdata=8xK%2F9E7wNDY30KDOfzmb5UPb2XEHeEaUnFAuhyIrX44%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
> --
> From my cell phone.
>
>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to