Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Lee Rhodes Wed, 06 May 2020 20:00:24 -0700

Michael,

One of my colleagues, Jon Malkin, pointed out that the vector-KLL will not
work for another reason and that is for each dimension, choosing whether to
delete the odd or even values in the compactor must be random and
independent of the other dimensions.  Otherwise you might get unwanted
correlation effects between the dimensions.


This is another argument that you should have independent compactors for
each dimension.  So you might as well stick with individual sketches for
each dimension.

Lee.

On Wed, May 6, 2020 at 4:39 PM Lee Rhodes <[email protected]> wrote:

> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]>
> wrote:
>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <[email protected]>; [email protected]
>> <[email protected]>
>> *Cc:* Edo Liberty <[email protected]>; [email protected] <
>> [email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our [email protected] so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <[email protected]>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <[email protected]>; Michael Himes <
>> [email protected]>
>> *Cc:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>>

Re: Permission to use KLL streaming-quantiles code in free open-source academic software

Reply via email to