Re: Permission to use KLL streaming-quantiles code in free open-source academic software

leerho Tue, 19 May 2020 16:47:32 -0700

Excellent work!

On Tue, May 19, 2020 at 4:04 PM Jon Malkin <[email protected]> wrote:


> I also used k=160, so in this case we matched nicely. And the bunches of
> 2^5 or 2^7 you were testing is exactly what I meant when referring to
> batched inputs. So that's good news.
>
> I'll take a more careful look through the code -- there was something with
> update using arrays of templated type A which was always cast to double,
> for instance. But this is certainly promising.
>
>   jon
>
> On Tue, May 19, 2020 at 3:32 PM Michael Himes <[email protected]>
> wrote:
>
>> Great tests (especially with the ordering), Jon!
>>
>> I did some scaling tests for dimensionality (1, 10, and 100), and this is
>> where I think the Numpy version shows its benefits.  I performed a test
>> similar to your setup:
>> - each sketch has k = 160 (unsure what you used for this value, if it
>> matters)
>> - 2^25 draws from a normalized Gaussian distribution (numpy.random.normal)
>> - get_quantiles(0.5)
>>
>> d=1    -- 84 s (this is the 123 s case you ran)
>> d=10   -- 88 s
>> d=100  --  294 s
>> d=1000 -- 2298 s (did this one for fun, but there is a lot of variability
>> in runtime)
>>
>> Note that I did not use a single-value method, just the Numpy version.
>> Also, I checked the compute cost of the Python loop, and it's about 1
>> second, so most of that ~80 seconds is the communication between Python and
>> C++.  The scaling relation looks to be better than linear, but there needs
>> to be a few more tests here to really determine that.
>>
>> But, as Lee pointed out, there is non-negligible overhead from crossing
>> the bridge between Python and C++.  It's small, but when doing it 2^25
>> times it adds up.  The Numpy implementation allows you to cross that bridge
>> much less often, albeit at the cost of some extra time programming that
>> part.  If I set up a queue that holds 2^5 values and then updates it, it's
>> quite a bit better.  Here are the results for the same dimensions as before:
>>
>> d=1   -- 8 s
>> d=10  -- 31 s
>> d=100 -- 257 s
>>
>> So, even with a small queue of 32 values, we see that a single sketch
>> using kll_sketches is faster than a kll_sketch by a factor of 2-3.  And
>> with the batch set to 2^7 values (this is how I use it in my project):
>> d=1   -- 4.2 s
>> d=10  -- 27 s
>> d=100 -- 251 s
>>
>> The speed gain doesn't seem to scale with dimensionality, but I think
>> that has more to do with the compute overhead of generating the data since
>> Numpy tends to be faster when working in 1D vs multiple dimensions.  But we
>> can see that it's possible to get runtimes much closer to C++ runtimes than
>> would be expected.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <[email protected]>
>> *Sent:* Tuesday, May 19, 2020 4:58 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Well, one thought was maybe we could always use the vectorized kll in
>> python and make it (relatively) easy to have it work with only 1 dimension.
>> It looks like there's still a non-trivial performance hit from that. But
>> wow.. I realized I could try something simple like reversing the
>> declaration order of single-update vs vector-update in the wrapper class.
>> And that dropped it to 35s!
>>
>> With that, it may be worth exploring a unified wrapper that handles
>> single items or vectors.
>>
>>   jon
>>
>> On Tue, May 19, 2020 at 1:52 PM leerho <[email protected]> wrote:
>>
>> We had a similar issue in Java trying to use JNI to access C code.  Every
>> transition across the "boundary" between Java and C took from 10 to 100
>> microseconds.  This made the JNI option pretty useless from our
>> standpoint.
>>
>> I don't know python that well, but I could well imagine that there may be
>> a similar issue here in moving data between Python and C++.
>>
>> That being said, compared to brute-force computation of these types of
>> queries in Python vs using even these (what we consider slow performing)
>> sketches in Python still may be a huge win.
>>
>> Lee.
>>
>>
>>
>> On Tue, May 19, 2020 at 1:28 PM Jon Malkin <[email protected]> wrote:
>>
>> I tried comparing the performance of the existing floats sketch vs the
>> new thing with a single dimension. And then I made a second update method
>> that handles a single item rather than creating an array of length 1 each
>> time. Otherwise, the scripts were as identical as possible. I fed in 2^25
>> gaussian-distributed values and queried for the mean to force some
>> computation on the sketch. I think get_quantile(0.5) vs
>> get_quantiles(0.5)[0][0] was the only difference,
>>
>> Existing kll_floats_sketch: 31s
>> kll_floatarray_sketches: 123s
>> with single-item update: 80s
>>
>> Same test in c++: 1.7s  (I can get it to 1.4s but that's using a worse
>> RNG so this seemed more fair)
>>
>> I didn't try anything with trying to batch updates, even though in theory
>> the new object can support that. This was more a test to see the
>> performance impact of using it for all kll sketches.
>>
>> At some level, if you're already ok taking the speed hit for python vs
>> C++ then maybe it doesn't matter. But >2x still seems significant to me.
>>
>>   jon
>>
>> On Thu, May 14, 2020 at 6:54 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Great, I'll be submitting the pull request shortly.  The codebase I'm
>> working with doesn't have any of the changes made in the past week or so,
>> hopefully that isn't too much of a hassle to merge.
>>
>> As an aside, my employer encourages us to contribute code to libraries
>> like this, so I'm happy to work on additional features for the Python
>> interface as needed.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <[email protected]>
>> *Sent:* Thursday, May 14, 2020 6:56 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> We've been polishing things up for a release, so that was one of several
>> things that we fixed over the last several days. Thank you for finding it!
>>
>> Anyway, if you're generally happy with the state of things (and are
>> allowed to under any employment terms), I'd encourage you to create pull
>> request to merge your changes into the main repo. It doesn't need to be
>> perfect as we can always make changes as part of the PR review or
>> post-merge.
>>
>> Thanks,
>>   jon
>>
>>
>> On Mon, May 11, 2020 at 2:25 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Thanks for taking a look, Jon.
>>
>> I pushed an update that address 2 & 4.
>>
>> #3 is actually something I had a question about. I've tested passing
>> numpy.nan into the update function, and it doesn't appear to break anything
>> (min, max, etc all still work correctly).  However, the reported number of
>> items per sketch counts the nan entries.  Is this the expected behavior, or
>> should the get_n() method return a number that does not count the nans it
>> has seen?  I expected the latter, so I'm worried that numpy's nan is being
>> treated differently.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <[email protected]>
>> *Sent:* Monday, May 11, 2020 4:32 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> I didn't look in super close detail, but the code overall looks pretty
>> good. Comments are below.
>>
>> Note that not all of these necessarily need changes or replies. I'm just
>> trying to document things we'll want to think about for keeping the library
>> general-purpose (and we can always make changes after merging, of course).
>>
>> 1. I worry the name kll_sketches is confusingly similar to kll_sketch.
>> Maybe vector_kll_sketches? But if there's a way to extend KLL in the future
>> to operate on an entire vector at a time (vs treating each dimension
>> independently) that'd become confusing. I think an inherently vectorized
>> version would be a very different beast, but I always worry I'm not being
>> imaginative enough. If merging into the Apache codebase, I'd probably wait
>> to see what the file looks like with the renaming before a final decision
>> on moving to its own file.
>>
>> 2. What happens if the input to update() has >2 dimensions? If that'd be
>> invalid, we should explicitly check and complain. If it'll Do The Right
>> Thing by operating on the first 2 dimensions (meaning correct indices)
>> that's fine, but otherwise should probably complain.
>>
>> 3. Can this handle sparse input vectors? Not sure how important that is
>> in general, even if your project doesn't require it. kll_sketch will ignore
>> NaNs, so those appearing would mean the number of items per sketch can
>> already differ.
>>
>> 4. I'd probably eat the very slightly increased space and go with 32 bits
>> for the number of dimensions (aka number of sketches). If trying to look at
>> a distribution of values for some machine learning application, it'd be
>> easy to overflow 65k dimensions for some tasks.
>>
>> 5. I imagine you've realized that it's easiest to do unit tests from
>> python in this case. That's another advantage of having this live in the
>> wrapper.
>>
>> 6. Finally, that assert issue is already obsolete :). Asserts were
>> converted if/throw exceptions late last week. It'll be flagged as a
>> conflict in merging, so no worries for now.
>>
>> Looking good at this point. And as I said, not all of these need changes
>> or comments from you.
>>
>>   jon
>>
>> On Mon, May 11, 2020 at 7:09 AM Michael Himes <[email protected]>
>> wrote:
>>
>> Understood, I went ahead and moved the new class to the kll_wrapper.cpp
>> file -- I'll leave it to you to decide if it's better as its own file.
>>
>> Also, while gcc 7.4.0 compiles the code without issue, using gcc 7.5.0
>> throws errors regarding the assert calls in kll_sketch_impl.hpp.  I added
>> an include of assert.h there and then it compiled without issue.  It's
>> possible that other compilers will also complain about that, so maybe this
>> is a good update to the main branch.
>>
>> Michael
>> ------------------------------
>> *From:* Jon Malkin <[email protected]>
>> *Sent:* Sunday, May 10, 2020 10:47 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> My only comment without having looked at actual code is that the new
>> class would be more appropriate in the python wrapper. Maybe even drop it
>> in as it's own file, as that would decrease recompile time a bit when
>> debugging (that's pybind's suggestion, anyway). Probably not a huge
>> difference with how light these wrappers are.
>>
>> If this is something that becomes widely used, to where we look at
>> pushing it into the base library, we'd look at whether we could share any
>> data across sketches. But we're far from that point currently. It'd be nice
>> to need to consider that.
>>
>>   jon
>>
>> On Sun, May 10, 2020, 7:33 PM leerho <[email protected]> wrote:
>>
>> Michael,  this has been a great interchange and certainly will allow us
>> to move forward more quickly.
>>
>> Thank you for working on this on a Mother's Day Sunday!
>>
>> I'm sure Alex and Jon may have more questions, when they get a chance to
>> look at it starting tomorrow.
>>
>> Cheers, and be safe and well!
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 6:25 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Re: testing, so far I've just done glorified unit tests for uniform and
>> normal distributions of varying sizes.  I plan to do some timing tests vs
>> the existing single-sketch Python class to see how it compares for 1, 10,
>> and 100 streams.
>>
>> 1. That makes sense.  One option to allow full Numpy compatibility but
>> without requiring a Python user to use Numpy would be to return everything
>> as lists, rather than Numpy arrays.  Numpy users could then convert those
>> lists into arrays, and non-Numpy users would be unaffected (aside from
>> needing the pybind11/numpy.h header).  Alternatively, some flag could be
>> set when instantiating the object that would control whether things are
>> returned as lists or arrays, though this still requires the numpy.h header
>> file.
>>
>> 2. I didn't change the kll_sketch code, I only defined a new (wrapper)
>> class called kll_sketches, which spawns a user-specified number of
>> sketches.  Each of those sketches are kll_sketch objects and uses all of
>> the existing code for that.  For fast execution in Python, the parallel
>> sketches must be spawned in C++, but the existing Python object could only
>> spawn a single sketch since it wraps the kll_sketch class.  Perhaps the
>> kll_sketches class would be better placed in the python/src/kll_wrapper.cpp
>> file?  I suppose you wouldn't need this class if you weren't using Python.
>>
>> 3. Yes, SerDe is very straight-forward here.  I've marked some stuff as
>> todo's, and that is one of them -- the plan is to do like you described and
>> call the relevant kll_sketch method on each of the sketches and return that
>> to Python in a sensible format.  For deserialization, it would just iterate
>> through them and load them into the kll_sketches object.  I don't require
>> it for my project, so I didn't bother to wrap that yet -- I'll take a look
>> sometime this week after I finish my work for the day, shouldn't take long
>> to do.
>>
>> 4. That makes sense.  Does using Numpy complicate that at all?  My
>> thought is that since under the hood everything is using the existing
>> kll_sketch class, it would have full compatibility with the rest of the
>> library (once SerDe is added in).
>>
>> Michael
>> ------------------------------
>> *From:* leerho <[email protected]>
>> *Sent:* Sunday, May 10, 2020 8:42 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Thanks for the link to your code.  My colleagues, Jon and Alex, will take
>> a closer look this next week.  They wrote this code so they are much closer
>> to it than I.
>>
>> What you have done so far makes sense for you as you want to get this
>> working in the NumPy environment as quickly as possible.  As soon as we
>> start thinking about incorporating this into our library other concerns
>> become important.
>>
>> 1. Adding API calls is the recommended way to add functionality (like
>> NumPy) to a library.  We cannot change API calls in a way that is only
>> useful with NumPy, because it would seriously impact other users of the
>> library that don't need NumPy.  If both sets of calls cannot simultaneously
>> exist in the same sketch API, then we need to consider other alternatives.
>>
>> 2.  Based on our previous discussions, I didn't envision that you would
>> have to change the kll_sketch code itself other than perhaps a "wrapper"
>> class that enables vectorized input to a vector of sketches and a
>> vectorized get result that creates a vector result from a vector of
>> sketches.  This would isolate the changes you need for NumPy from the
>> sketch itself.  This is also much easier to support, maintain and debug.
>>
>> 3. If you don't change the internals of the sketch then SerDe becomes
>> pretty straightforward. I don't know if you need a single serialization
>> that represents a full vector of sketches,  but if you do, then I would
>> just iterate over the individual serdes and figure out how to package it.
>> I really don't think you want to have to rewrite this low-level stuff.
>>
>> 4. Binary compatibility is critically important for us and I think will
>> be important for you as well.  There are two dimensions of binary
>> compatibility: history and language.  This means that a kll sketch
>> serialized from Java, can be successfully read by C++ and visa versa.
>> Similarly, a kll sketch serialized today will be able to be read many years
>> from now.     Another aspect of this would mean being able to collect, say,
>> 100 sketches that were not created using the NumPy version, and being able
>> to put them together in a NumPy vector; and visa versa.
>>
>> I hope all of this make sense to you.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>> On Sun, May 10, 2020 at 4:21 PM leerho <[email protected]> wrote:
>>
>> Michael,
>> This is great!  What testing have you been able to do so far?
>>
>>
>> On Sun, May 10, 2020 at 3:31 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Lee,
>>
>> Thanks for all of that information, it's quite helpful to get a better
>> understanding of things.
>>
>> I've put the code on Github if you'd like to take a look:
>> https://github.com/mdhimes/incubator-datasketches-cpp
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdhimes%2Fincubator-datasketches-cpp&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168301125&sdata=FJauemLpFxaPhlYPW8hXcSPwR53tR9crhHYWxHOWfsE%3D&reserved=0>
>>
>> Changes are
>> - new class in kll/include/kll_sketch.hpp, w/ associated constructor in
>> kll/include/kll_sketch_impl.hpp.  This class spawns a specified number of
>> sketches.
>> - new Python interface functions in python/src/kll_wrapper.cpp
>>
>> The only new dependency introduced is the pybind11/numpy.h header file.
>> The new Numpy-compatible Python classes retain identical functionality to
>> the existing classes (with minor changes to method names, e.g.,
>> get_min_value --> get_min_values), except that I have not yet implemented
>> merging or (de)serialization.  These would be straight-forward to
>> implement, if needed.
>>
>> Re: characterization tests, I'll take a look at those tests you linked to
>> and see about running them, time and compute resources permitting.
>>
>> Michael
>> ------------------------------
>> *From:* leerho <[email protected]>
>> *Sent:* Sunday, May 10, 2020 5:32 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Is there a place on GitHub somewhere where I could look at your code so
>> far?  The reason I ask, is before you do a PR, we would like to determine
>> where a contribution such as this should be placed.
>>
>> Our library is split up among different repositories, determined by
>> language and dependencies.  This keeps the user downloads smaller and more
>> focused.   We have two library repos for the core sketch algorithms, one
>> for Java and one for C++/Python, where the dependencies are very lean,
>> which simplifies integration into other systems.  We have separate repos
>> for adaptors, which depend on one of the core repos. On the Java side, we
>> have separate repos for adaptors for Apache Hive and Apache Pig, as the
>> dependencies for each of these are quite large.  For C++, we have a
>> dedicated repo for the adaptors for PostgreSQL.
>>
>> Some of our adaptors are hosted with the target system.  For example, our
>> Druid adaptors were contributed directly into Apache Druid.
>>
>> I assume your code has dependencies on Python, NumPy and
>> DataSketches-cpp. It is not clear to me at the moment whether we should
>> create a separate repo for this or have a separate group of directories in
>> our cpp repo.
>>
>> ****
>> We have a separate repo for our characterization code, which is not
>> formally "released" as an Apache release.  It exists because we want others
>> to be able to reproduce (or challenge) our claims of speed performance or
>> accuracy.  It is the one repo where we have all languages and many
>> different dependencies.  The coding style is not as rigorous or as well
>> documented as our repos that do have formal releases.
>>
>> Characterization testing is distinctly different from Unit Tests, which
>> basically checks all the main code paths and makes sure that the program
>> works as it should.  The key metric is code coverage and Unit Tests should
>> be fast as it is run on every check-in of new code.  Characterization is
>> also different from Integration Testing, which is testing how well the code
>> works when integrated into larger systems.
>>
>> Characterization tests are unique to our kind of library. Because our
>> algorithms are probabilistic in nature, in order to verify accuracy or
>> speed performance we need to run many thousands of trials to eliminate
>> statistical noise in the results.  And when the data is large, this can
>> take a long time.  You can peruse our website for many examples as all the
>> plots result from various characterization studies.  What appears on the
>> website is but a small fraction of all the testing we have done.
>>
>> There are no "standard" tests as every sketch is different so we have to
>> decide what is important to measure for a particular sketch, but the basic
>> groups are *speed* and *accuracy*.
>>
>> For speed there are many possible measurements, but the basic ones are
>> update speed, merge speed, Serialization / Deserialization speed, get
>> estimate or get result speeds.
>>
>> For accuracy we want to validate that the sketch is performing within the
>> bounds of the theoretical error distribution.  We want to measure this
>> accuracy in the context of a stand-alone, purely streaming sketch and also
>> in the context of merging many sketches together.
>>
>> We also try to do these same tests comparing the results against other
>> alternatives users might have.  We have performed these same
>> characterizations on other publically available sketches as well as against
>> traditional, brute-force approaches to solving the same problem.
>>
>> For the solution you have developed, we would depend on you to decide
>> what properties would be most important to characterize for users of this
>> solution.  It should be very similar to what you would write in a paper
>> describing this solution;  you want to convince the reader that this is
>> very useful and why.
>>
>> Since the first sketch you have leveraged is the KLL quantiles sketch, I
>> would think you would want some characterizations similar to what we did
>> for our studies
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatasketches.apache.org%2Fdocs%2FQuantiles%2FKLLSketch.html&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168311123&sdata=fC1FtRi8ex9h0o58K7KYGPL6i9Dv9xq4PiUDZ5yE91o%3D&reserved=0>
>> comparing our older quantiles sketch and the KLL sketch.
>>
>> ****
>> For the Java characterization tests, we have "standardized" on having
>> small configuration files which define the key parameters of the test.
>> These are simple text files
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-characterization%2Ftree%2Fmaster%2Fsrc%2Fmain%2Fresources&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=1VDEQds1NCgY%2FRt6%2BEYLy3VcZz5uBtKEfpONGJhZnJg%3D&reserved=0>
>> of key-value pairs.  We don't have any centralized definition of these
>> pairs, just that they are human readable and intelligible.  They are
>> different for each type of sketch.
>>
>> For the C++ tests, we don't have a collection of config files yet (this
>> is one of our TODOs), but the same kind of parameters are set in the code
>> itself.
>>
>> We will likely want to set up a separate directory for your
>> characterization tests.
>>
>> I hope you find this helpful.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sun, May 10, 2020 at 10:05 AM Michael Himes <[email protected]>
>> wrote:
>>
>> The code is in a good state now.  It can take individual values, lists,
>> or Numpy arrays as input, and it returns back Numpy arrays.  There are some
>> additional features, like being able to specify which sketches the user
>> wants to, e.g., get quantiles for.
>>
>> But, I have only done minor testing with uniform and normal
>> distributions.  I'd like to put it through more extensive testing (and some
>> documentation) before releasing it, and it sounds like your
>> characterization tests are the way to go -- it's not science if it's not
>> reproducible!  Is there a standard set of tests for this purpose?  If not,
>> are there standard tests that have been used for the existing codebase?
>>
>> Michael
>> ------------------------------
>> *From:* leerho <[email protected]>
>> *Sent:* Saturday, May 9, 2020 7:21 PM
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is great.  The first step is to get your project working!  Once you
>> think you are ready, it would be really useful if you could do some
>> characterization testing in the NumPy environment. Characterization tests
>> are what we run to fully understand how a sketch performs over a range of
>> parameters and using thousands to millions of trials.  You can see some of
>> the accuracy and speed performance plots of various sketches on our
>> website.  Sometimes these can take hours to run.  We typically use
>> synthetic data to drive our characterization tests to make them
>> reproducible.
>>
>> Real data can also be used and one comparison test I would recommend is
>> comparing how long it takes to get approximate results using sketches
>> versus how long it would take to get exact results using brute force
>> methods.  The bigger the data set is the better :)
>>
>> We don't have much experience with NumPy so this will be a new
>> environment for us.  But before you get too deep into this please get us
>> involved.  We have been characterizing these streaming algorithms for a
>> number of years, and would like to help you.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 2:18 PM Michael Himes <[email protected]>
>> wrote:
>>
>> I'm not quite sure what being a committer entails, but yeah I'm happy to
>> contribute.  I can't commit a lot of time to working on it, but with how
>> things went for KLL I don't think it will take a lot of time for the other
>> sketches if they are formatted in a similar manner.  Getting this library
>> integrated into numpy/scipy would be awesome, I'm sure I could get some
>> others in my field to begin using it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Saturday, May 9, 2020 5:06 PM
>> *To:* Michael Himes <[email protected]>; [email protected]
>> <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> This is just awesome!   Would you be interested in becoming a committer
>> on our project?  It is not automatic, but we could work with you to bring
>> you up to speed on the other sketches in the library.  If you could help us
>> integrate DataSketches into NumPy and possibly SciPy (not sure if this is
>> necessary) it would be a very significant contribution and we would
>> definitely want you to be part of our community!
>>
>> Thanks,
>>
>> Lee.
>>
>> On Sat, May 9, 2020 at 1:41 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Lee,
>>
>> Thanks for the notice, I went ahead and subscribed to the list.
>>
>> As for Jon's email, this is actually what I have currently implemented!
>> Once I finish ironing out a couple improvements, I'm going to move some
>> code around to follow the existing coding style, put it on Github, and
>> submit a pull request.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Saturday, May 9, 2020 4:22 PM
>> *To:* Michael Himes <[email protected]>
>> *Subject:* Fwd: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Hi Michael,
>> I don't think you saw this email as I doubt you are subscribed to our
>> [email protected] email list.
>>
>> We would like to have you as part of our larger community, as others
>> might also have suggestions on how to move your project forward.
>> You can subscribe by sending an empty email to
>> [email protected].
>>
>> Lee.
>>
>> ---------- Forwarded message ---------
>> From: *Jon Malkin* <[email protected]>
>> Date: Thu, May 7, 2020 at 4:11 PM
>> Subject: Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>> To: <[email protected]>
>> Cc: Lee Rhodes <[email protected]>, Edo Liberty <
>> [email protected]>, [email protected] <[email protected]>
>>
>>
>> We're using pybind11 to get a C++ interface with python (vs raw C). The
>> wrappers themselves are quite thin, but they do have examples of calling
>> functions defined in the wrapper as opposed to only the sketch object.
>>
>> I believe the easiest way to do this will be to define a pretty simple
>> C++ object and create a pybind wrapper for it.  That object would contain a
>> std::vector<kll_sketch>.  Then you'd define an update method for your
>> custom object that iterates through a numpy array and calls update() on the
>> appropriate sketch. You'd also want to define something similar for
>> get_quantile() or whatever other methods you need that iterates through
>> that vector of sketches and returns the result in a numpy array.
>>
>> That's a pretty lightweight object. And then you'd use a similar thin
>> pybind wrapper around it to make it play nicely with python. Since our C++
>> library is just templates, you'd end up with a free-standing library, with
>> no requirement that the base datasketches library be involved.
>>
>>   jon
>>
>> On Thu, May 7, 2020 at 1:08 PM Michael Himes <[email protected]>
>> wrote:
>>
>> I would be happy to share whatever I come up with (if anything).  The
>> lack of a Numpy/Scipy implementation is what led me to the DataSketches
>> library, it would be very useful to myself and others if it were a part of
>> Numpy/Scipy.
>>
>> For what it's worth, passing in a Numpy array and manipulating it from
>> the C++ side is quite easy.  On the other hand, figuring out how to spawn m
>> sketches and pass the values along to that looks like it'll be more
>> challenging, there is a lot of code here and it'll take some time for me to
>> familiarize myself with it.
>>
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Thursday, May 7, 2020 12:00 PM
>> *To:* Michael Himes <[email protected]>
>> *Cc:* Edo Liberty <[email protected]>; [email protected] <
>> [email protected]>; [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> If you do figure out how to do this, it would be great if you could share
>> it with us.  We would like to extend  it to other sketches and submit it as
>> an added functionality to NumPy.  I have been looking at the NomPy and
>> SciPy libraries and have not found anything close to what we have.
>>
>> Lee.
>>
>>
>> On Thu, May 7, 2020 at 7:08 AM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Lee, Jon,
>>
>> Thanks for the information.  I tried to vectorize things this morning and
>> ran into that exact problem -- since the offsets can differ, it leads to
>> slices of different lengths, which wouldn't be possible to store as a
>> single Numpy array.
>>
>> Lee, your understanding of my problem is spot on.  n vectors of size m,
>> where all m elements of each vector are a float (no NaNs or missing
>> values).  I am interested in quantiles at rank r for each of the m
>> streams.  Only 1 sketch will operate simultaneously, saving/loading the
>> sketch is not required (though it would be a nice feature), and sketches
>> would not need to be merged (no serialization/deserialization).
>>
>> Not surprisingly, it looks like your original suggestion of handling this
>> on the C++ side is the way to go.  Once I have time to dive into the code,
>> my plan is to write something that implements what you described in the
>> earlier email.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Wednesday, May 6, 2020 10:43 PM
>> *To:* Michael Himes <[email protected]>
>> *Cc:* [email protected] <[email protected]>; Edo
>> Liberty <[email protected]>; [email protected] <[email protected]>
>>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> One of my colleagues, Jon Malkin, pointed out that the vector-KLL will
>> not work for another reason and that is for each dimension, choosing
>> whether to delete the odd or even values in the compactor must be random
>> and independent of the other dimensions.  Otherwise you might get unwanted
>> correlation effects between the dimensions.
>>
>> This is another argument that you should have independent compactors for
>> each dimension.  So you might as well stick with individual sketches for
>> each dimension.
>>
>> Lee.
>>
>> On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]>
>> wrote:
>>
>> Michael,
>>
>> Allow me to back up for a moment to make sure I understand your problem.
>>
>> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
>> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
>> element, or equivalently, the *i*th dimension.
>>
>> Assumptions:
>>
>>    - All vectors, *V*, are of the same size *m.*
>>    - All elements, *x_i*, are valid numbers of the same type. No missing
>>    values, and if you are using *floats*, this means no *NaN*s.
>>
>> In aggregate, the *n* vectors represent *m* *independent* distributions
>> of values.
>>
>> Your task is to be able to obtain *m* quantiles at rank *r* in a single
>> query.
>>
>> ****
>> To do this, using your idea, would require vectorization of the entire
>> sketch and not just the compactors.  The inputs are vectors, the result of
>> operations such as getQuantile(r), getQuantileUpperBound(r),
>> getQuantileLowerBound(r), are also vectors.
>>
>> This sketch will be a large data structure, which leads to more questions
>> ...
>>
>>    - Do you anticipate having many of these vectorized sketches
>>    operating simultaneously?
>>    - Is there any requirement to store and later retrieve this sketch?
>>    - Or, the nearly equivalent question: Do you require merging of these
>>    sketches (across clusters, for example)?  Which also means serialization
>>    and deserialization.
>>
>> I am concerned that this vector-quantiles sketch would be limited in the
>> sense that it may not be as widely applicable as it could be.
>>
>> Our experience with real data is that it is ugly with missing values,
>> NaN, nulls, etc.  Which means we would not be able to vectorize the
>> compactor.  Each dimension *i* would need a separate independent
>> compactor because the compaction times will vary depending on missing
>> values or NaNs in the data.
>>
>> Spacewise, I don't think having separate independent sketches for each
>> dimension would be much smaller than vectorizing the entire sketch, because
>> the internals of the existing sketch are already quite space efficient
>> leveraging compact arrays, etc.
>>
>> As a first step I would favor figuring out how to access the NumPy data
>> structure on the C++ side, having individual sketches for each
>> dimension, and doing the iterations updating the sketches in C++.   It also
>> has the advantage of leveraging code that exists and it would automatically
>> be able to leverage any improvements to the sketch code over time.  In
>> addition, it could be a prototype of how to integrate other sketches into
>> the NumPy ecosystem.
>>
>> A fully vectorized sketch would be a separate implementation and would
>> not be able to take advantage of these points.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-datasketches-cpp%2Fblob%2Fmaster%2Fpython%2Ftests%2Fkll_test.py&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168321118&sdata=w7zrcQlOomiwHBox4fKCJaF327ob2ZCZN6pynYNuMDU%3D&reserved=0>
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <[email protected]>; [email protected]
>> <[email protected]>
>> *Cc:* Edo Liberty <[email protected]>; [email protected] <
>> [email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our [email protected] so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <[email protected]>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <[email protected]>; Michael Himes <
>> [email protected]>
>> *Cc:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7Cfa5203b3c7234ec967f908d7fc3763db%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637255187168331111&sdata=3AeG70Y7qlS1qoSTZ7U%2FHDJ3N4yX7%2FUJkK2ospj0nmw%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>> --
>> From my cell phone.
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to