I think the key here is that for an m dimensional vector you'd need to ensure that you have m independent offsets when doing each compaction. You can't share bits (beyond the fact that roughly half of them will match since there are only 2 options, of course) between dimensions for that.
jon On Wed, May 6, 2020 at 5:16 PM Lee Rhodes <[email protected]> wrote: > Michael, > > Allow me to back up for a moment to make sure I understand your problem. > > You have a large number of large vectors of the form *V_n = {x_i}:* *n* > vectors of size *m, *where *x* is a *number* and *x_i* is the *i*th > element, or equivalently, the *i*th dimension. > > Assumptions: > > - All vectors, *V*, are of the same size *m.* > - All elements, *x_i*, are valid numbers of the same type. No missing > values, and if you are using *floats*, this means no *NaN*s. > > In aggregate, the *n* vectors represent *m* *independent* distributions > of values. > > Your task is to be able to obtain *m* quantiles at rank *r* in a single > query. > > **** > To do this, using your idea, would require vectorization of the entire > sketch and not just the compactors. The inputs are vectors, the result of > operations such as getQuantile(r), getQuantileUpperBound(r), > getQuantileLowerBound(r), are also vectors. > > This sketch will be a large data structure, which leads to more questions > ... > > - Do you anticipate having many of these vectorized sketches operating > simultaneously? > - Is there any requirement to store and later retrieve this sketch? > - Or, the nearly equivalent question: Do you require merging of these > sketches (across clusters, for example)? Which also means serialization > and deserialization. > > I am concerned that this vector-quantiles sketch would be limited in the > sense that it may not be as widely applicable as it could be. > > Our experience with real data is that it is ugly with missing values, NaN, > nulls, etc. Which means we would not be able to vectorize the compactor. > Each dimension *i* would need a separate independent compactor because > the compaction times will vary depending on missing values or NaNs in the > data. > > Spacewise, I don't think having separate independent sketches for each > dimension would be much smaller than vectorizing the entire sketch, because > the internals of the existing sketch are already quite space efficient > leveraging compact arrays, etc. > > As a first step I would favor figuring out how to access the NumPy data > structure on the C++ side, having individual sketches for each > dimension, and doing the iterations updating the sketches in C++. It also > has the advantage of leveraging code that exists and it would automatically > be able to leverage any improvements to the sketch code over time. In > addition, it could be a prototype of how to integrate other sketches into > the NumPy ecosystem. > > A fully vectorized sketch would be a separate implementation and would not > be able to take advantage of these points. > > Lee. > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 6, 2020 at 2:47 PM Michael Himes <[email protected]> > wrote: > >> Hi Lee, >> >> I don't think there is a problem with the DataSketches library, just that >> it doesn't support what I am trying to do -- looking in the documentation, >> it only supports streams of ints or floats, and those situations work fine >> for me. Here's what I did: >> - began with the KLL test .py file: >> https://github.com/apache/incubator-datasketches-cpp/blob/master/python/tests/kll_test.py >> - replaced line 30 with kll.update(np.ones(10) * randn()) to have a >> Numpy array of 10 identical values. >> - ran the code >> >> This leads to the following error, as expected: >> TypeError: update(): incompatible function arguments. The following >> argument types are supported: >> 1. (self: datasketches.kll_floats_sketch, item: float) -> None >> >> Invoked with: <datasketches.kll_floats_sketch object at 0x7f1e128989d0>, >> array([-1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424, >> -1.17528424, -1.17528424, -1.17528424, -1.17528424, -1.17528424]) >> >> It's not coded to support Numpy arrays, therefore it complains. What I >> would ideally like to have happen in this scenario is it would treat each >> element in the array as a separate stream. Then, later when getting a >> given quantile, it would give 10 values, one for each stream. I don't see >> an easy approach to implementing this on the Python side besides a very >> slow iterative approach, and admittedly my C++ is quite rusty so I haven't >> looked into the codebase to see how I might modify things there to support >> this functionality. >> >> Re: the streaming-quantiles code being easily modified, I believe the >> only necessary changes would be changing the Compactor class to be a >> subclass of numpy.ndarray, rather than list, and implementing methods for >> the list-specific methods that are used, like .append(). Then, it isn't >> necessary to loop over the streams since we can make use of Numpy's >> broadcasting, which will handle the looping in its C++ code, as you >> mentioned. I'll work on this and see if it really is as straight-forward >> as it seems. >> >> If you have any advice on how to use DataSketches for my problem, I'm >> certainly open to that. >> >> Thanks, >> Michael >> ------------------------------ >> *From:* Lee Rhodes <[email protected]> >> *Sent:* Wednesday, May 6, 2020 4:37 PM >> *To:* Michael Himes <[email protected]>; [email protected] >> <[email protected]> >> *Cc:* Edo Liberty <[email protected]>; [email protected] < >> [email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> Michael, >> >> Thank you for considering the DataSketches library. I am adding this >> thread to our [email protected] so that our whole team can >> contribute to finding a solution for you. >> >> WRT the error you experienced, please help us help you by sharing with us >> what the exact error was. >> >> We are about to release a major upgrade to the DataSketches C++/Python >> product in the next few weeks. We have fixed a number of stability issues >> and bugs, which may solve the problem. Nonetheless, we want to work with >> you to get your problem solved. >> >> Updating 1e5 sketches in a system is not a problem in Java or C++. We >> have real-time systems today that generate and process over 1e9 sketches >> every day. Unfortunately our experience tells us that looping in Python >> code will be 10 to 100 times slower than Java or C++. This is because the >> code would have to switch from Python to C++ for every vector element. >> >> By comparison, the streaming-quantiles code could be easily modified to >> use Numpy arrays and operate on vectors. >> >> >> I would like to understand more about what you have in mind that would be >> "easily modified". >> >> NumPy achieves its speed performance by doing all of the matrix >> operations in pre-compiled C++ code. To achieve best performance, we would >> want to read and loop through the NumPy data structure on the C++ side >> leveraging the C++ DataSketches library directly. I am not sure what would >> be involved to actually accomplish that. >> >> But first we need to get your Python + NumPy code working correctly with >> our library so we can find out what its actual performance is. >> >> Cheers, >> >> Lee. >> >> >> >> >> >> On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]> >> wrote: >> >> Hi Edo, Lee, >> >> Thanks for the prompt response. I looked at the datasketches library, >> and while it seems to have a lot more features, it looks like it'll be a >> lot more difficult to get it to work for my desired use case. >> >> My problem is that I need quantiles for each element of a vector (length >> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the >> order of 1e6 -- 1e8). I tried using datasketches's KLL with Numpy arrays, >> but it throws an error, so it doesn't seem like datasketches handles this >> situation currently. >> >> To use datasketches, I think I would need to instantiate 1 object per >> vector element, and I suspect this will slow things down considerably due >> to iterating over the objects when each vector is processed. By >> comparison, the streaming-quantiles code could be easily modified to use >> Numpy arrays and operate on vectors. I ran a few unit tests on both codes >> and found equivalent behavior, as expected. >> >> Do you have any recommendation(s) for this situation? Are there known >> limitations of the streaming-quantiles code that would cause issues for my >> use case? Are the other methods offered in datasketches 'better' than the >> KLL implemented in streaming-quantiles? I'm quite out of my area of >> expertise, so I appreciate any advice you can offer, and I will of course >> acknowledge it in the publication. >> >> Best, >> Michael >> >> ------------------------------ >> *From:* Edo Liberty <[email protected]> >> *Sent:* Tuesday, May 5, 2020 8:09 PM >> *To:* Lee Rhodes <[email protected]>; Michael Himes < >> [email protected]> >> *Cc:* [email protected] <[email protected]> >> *Subject:* Re: Permission to use KLL streaming-quantiles code in free >> open-source academic software >> >> +Lee >> >> Hi Michael, Thanks for reaching out. >> While you can certainly do that, I recommend using the python-Binded >> datasketches library. It will be more robust, faster, and bug free than my >> code :) >> >> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> >> wrote: >> >> Hi Edo, >> >> I'm currently working on a Python package for >> machine-learning-accelerated exoplanet modeling. It is free and open >> source (see here if you're curious https://github.com/exosports/HOMER >> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C20300349fe264123e49908d7f1fd4cbe%7C5b16e18278b3412c919668342689eeb7%7C0%7C0%7C637243942528081567&sdata=pl4piN5odQiUO2SkMq%2FLRL0UqWOrqkimd0c12RpdpY4%3D&reserved=0>), >> and it's meant purely for reproducible academic research. >> >> I'm adding some new features to the software, and one of them requires >> computing quantiles for a data set that cannot fit into memory. After >> searching around for different methods to do this, your KLL method seemed >> to be a good option in terms of speed and space requirements. >> >> Rather than reinvent the wheel and code my own implementation of the >> method from scratch, I was wondering if you'd be willing to allow me to use >> your code? I don't see a license, so I wanted to make sure you're okay >> with this. I could implement it as a submodule within my repo, or I could >> only include the kll.py file and add some additional comments pointing to >> your repo and such, whichever you prefer. >> >> Best, >> Michael >> >>
