I think the key here is that for an m dimensional vector you'd need to
ensure that you have m independent offsets when doing each compaction. You
can't share bits (beyond the fact that roughly half of them will match
since there are only 2 options, of course) between dimensions for that.

  jon

On Wed, May 6, 2020 at 5:16 PM Lee Rhodes <[email protected]>
wrote:

> Michael,
>
> Allow me to back up for a moment to make sure I understand your problem.
>
> You have a large number of large vectors of the form *V_n = {x_i}:*  *n*
> vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th
> element, or equivalently, the *i*th dimension.
>
> Assumptions:
>
>    - All vectors, *V*, are of the same size *m.*
>    - All elements, *x_i*, are valid numbers of the same type. No missing
>    values, and if you are using *floats*, this means no *NaN*s.
>
> In aggregate, the *n* vectors represent *m* *independent* distributions
> of values.
>
> Your task is to be able to obtain *m* quantiles at rank *r* in a single
> query.
>
> ****
> To do this, using your idea, would require vectorization of the entire
> sketch and not just the compactors.  The inputs are vectors, the result of
> operations such as getQuantile(r), getQuantileUpperBound(r),
> getQuantileLowerBound(r), are also vectors.
>
> This sketch will be a large data structure, which leads to more questions
> ...
>
>    - Do you anticipate having many of these vectorized sketches operating
>    simultaneously?
>    - Is there any requirement to store and later retrieve this sketch?
>    - Or, the nearly equivalent question: Do you require merging of these
>    sketches (across clusters, for example)?  Which also means serialization
>    and deserialization.
>
> I am concerned that this vector-quantiles sketch would be limited in the
> sense that it may not be as widely applicable as it could be.
>
> Our experience with real data is that it is ugly with missing values, NaN,
> nulls, etc.  Which means we would not be able to vectorize the compactor.
> Each dimension *i* would need a separate independent compactor because
> the compaction times will vary depending on missing values or NaNs in the
> data.
>
> Spacewise, I don't think having separate independent sketches for each
> dimension would be much smaller than vectorizing the entire sketch, because
> the internals of the existing sketch are already quite space efficient
> leveraging compact arrays, etc.
>
> As a first step I would favor figuring out how to access the NumPy data
> structure on the C++ side, having individual sketches for each
> dimension, and doing the iterations updating the sketches in C++.   It also
> has the advantage of leveraging code that exists and it would automatically
> be able to leverage any improvements to the sketch code over time.  In
> addition, it could be a prototype of how to integrate other sketches into
> the NumPy ecosystem.
>
> A fully vectorized sketch would be a separate implementation and would not
> be able to take advantage of these points.
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]>
> wrote:
>
>> Hi Lee,
>>
>> I don't think there is a problem with the DataSketches library, just that
>> it doesn't support what I am trying to do -- looking in the documentation,
>> it only supports streams of ints or floats, and those situations work fine
>> for me.  Here's what I did:
>> - began with the KLL test .py file:
>> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py
>> - replaced line 30 with kll.update(np.ones(10) * randn())  to have a
>> Numpy array of 10 identical values.
>> - ran the code
>>
>> This leads to the following error, as expected:
>> TypeError: update(): incompatible function arguments. The following
>> argument types are supported:
>>     1. (self: datasketches.kll_floats_sketch, item: float) -> None
>>
>> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>,
>> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424,
>>        -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424])
>>
>> It's not coded to support Numpy arrays, therefore it complains.  What I
>> would ideally like to have happen in this scenario is it would treat each
>> element in the array as a separate stream.  Then, later when getting a
>> given quantile, it would give 10 values, one for each stream.  I don't see
>> an easy approach to implementing this on the Python side besides a very
>> slow iterative approach, and admittedly my C++ is quite rusty so I haven't
>> looked into the codebase to see how I might modify things there to support
>> this functionality.
>>
>> Re: the streaming-quantiles code being easily modified, I believe the
>> only necessary changes would be changing the Compactor class to be a
>> subclass of numpy.ndarray, rather than list, and implementing methods for
>> the list-specific methods that are used, like .append().  Then, it isn't
>> necessary to loop over the streams since we can make use of Numpy's
>> broadcasting, which will handle the looping in its C++ code, as you
>> mentioned.  I'll work on this and see if it really is as straight-forward
>> as it seems.
>>
>> If you have any advice on how to use DataSketches for my problem, I'm
>> certainly open to that.
>>
>> Thanks,
>> Michael
>> ------------------------------
>> *From:* Lee Rhodes <[email protected]>
>> *Sent:* Wednesday, May 6, 2020 4:37 PM
>> *To:* Michael Himes <[email protected]>; [email protected]
>> <[email protected]>
>> *Cc:* Edo Liberty <[email protected]>; [email protected] <
>> [email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> Michael,
>>
>> Thank you for considering the DataSketches library.   I am adding this
>> thread to our [email protected] so that our whole team can
>> contribute to finding a solution for you.
>>
>> WRT the error you experienced, please help us help you by sharing with us
>> what the exact error was.
>>
>> We are about to release a major upgrade to the DataSketches C++/Python
>> product in the next few weeks.  We have fixed a number of stability issues
>> and bugs, which may solve the problem.  Nonetheless, we want to work with
>> you to get your problem solved.
>>
>> Updating 1e5 sketches in a system is not a problem in Java or C++.   We
>> have real-time systems today that generate and process over 1e9 sketches
>> every day.  Unfortunately our experience tells us that looping in Python
>> code will be 10 to 100 times slower than Java or C++.  This is because the
>> code would have to switch from Python to C++ for every vector element.
>>
>> By comparison, the streaming-quantiles code could be easily modified to
>> use Numpy arrays and operate on vectors.
>>
>>
>> I would like to understand more about what you have in mind that would be
>> "easily modified".
>>
>> NumPy achieves its speed performance by doing all of the matrix
>> operations in pre-compiled C++ code.  To achieve best performance, we would
>> want to read and loop through the NumPy data structure on the C++ side
>> leveraging the C++ DataSketches library directly.  I am not sure what would
>> be involved to actually accomplish that.
>>
>> But first we need to get your Python + NumPy code working correctly with
>> our library so we can find out what its actual performance is.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo, Lee,
>>
>> Thanks for the prompt response.  I looked at the datasketches library,
>> and while it seems to have a lot more features, it looks like it'll be a
>> lot more difficult to get it to work for my desired use case.
>>
>> My problem is that I need quantiles for each element of a vector (length
>> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
>> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
>> but it throws an error, so it doesn't seem like datasketches handles this
>> situation currently.
>>
>> To use datasketches, I think I would need to instantiate 1 object per
>> vector element, and I suspect this will slow things down considerably due
>> to iterating over the objects when each vector is processed.  By
>> comparison, the streaming-quantiles code could be easily modified to use
>> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
>> and found equivalent behavior, as expected.
>>
>> Do you have any recommendation(s) for this situation?  Are there known
>> limitations of the streaming-quantiles code that would cause issues for my
>> use case?  Are the other methods offered in datasketches 'better' than the
>> KLL implemented in streaming-quantiles?  I'm quite out of my area of
>> expertise, so I appreciate any advice you can offer, and I will of course
>> acknowledge it in the publication.
>>
>> Best,
>> Michael
>>
>> ------------------------------
>> *From:* Edo Liberty <[email protected]>
>> *Sent:* Tuesday, May 5, 2020 8:09 PM
>> *To:* Lee Rhodes <[email protected]>; Michael Himes <
>> [email protected]>
>> *Cc:* [email protected] <[email protected]>
>> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
>> open-source academic software
>>
>> +Lee
>>
>> Hi Michael, Thanks for reaching out.
>> While you can certainly do that, I recommend using the python-Binded
>> datasketches library. It will be more robust, faster, and bug free than my
>> code :)
>>
>> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]>
>> wrote:
>>
>> Hi Edo,
>>
>> I'm currently working on a Python package for
>> machine-learning-accelerated exoplanet modeling.  It is free and open
>> source (see here if you're curious https://github.com/exosports/HOMER
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>),
>> and it's meant purely for reproducible academic research.
>>
>> I'm adding some new features to the software, and one of them requires
>> computing quantiles for a data set that cannot fit into memory.  After
>> searching around for different methods to do this, your KLL method seemed
>> to be a good option in terms of speed and space requirements.
>>
>> Rather than reinvent the wheel and code my own implementation of the
>> method from scratch, I was wondering if you'd be willing to allow me to use
>> your code?  I don't see a license, so I wanted to make sure you're okay
>> with this.  I could implement it as a submodule within my repo, or I could
>> only include the kll.py file and add some additional comments pointing to
>> your repo and such, whichever you prefer.
>>
>> Best,
>> Michael
>>
>>

Reply via email to