Michael,

Thank you for considering the DataSketches library.   I am adding this
thread to our [email protected] so that our whole team can
contribute to finding a solution for you.

WRT the error you experienced, please help us help you by sharing with us
what the exact error was.

We are about to release a major upgrade to the DataSketches C++/Python
product in the next few weeks.  We have fixed a number of stability issues
and bugs, which may solve the problem.  Nonetheless, we want to work with
you to get your problem solved.

Updating 1e5 sketches in a system is not a problem in Java or C++.   We
have real-time systems today that generate and process over 1e9 sketches
every day.  Unfortunately our experience tells us that looping in Python
code will be 10 to 100 times slower than Java or C++.  This is because the
code would have to switch from Python to C++ for every vector element.

By comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.


I would like to understand more about what you have in mind that would be
"easily modified".

NumPy achieves its speed performance by doing all of the matrix operations
in pre-compiled C++ code.  To achieve best performance, we would want to
read and loop through the NumPy data structure on the C++ side leveraging
the C++ DataSketches library directly.  I am not sure what would be
involved to actually accomplish that.

But first we need to get your Python + NumPy code working correctly with
our library so we can find out what its actual performance is.

Cheers,

Lee.





On Wed, May 6, 2020 at 12:10 PM Michael Himes <[email protected]>
wrote:

> Hi Edo, Lee,
>
> Thanks for the prompt response.  I looked at the datasketches library, and
> while it seems to have a lot more features, it looks like it'll be a lot
> more difficult to get it to work for my desired use case.
>
> My problem is that I need quantiles for each element of a vector (length
> on the order of 1e4 -- 1e5), for some finite stream of vectors (on the
> order of 1e6 -- 1e8).  I tried using datasketches's KLL with Numpy arrays,
> but it throws an error, so it doesn't seem like datasketches handles this
> situation currently.
>
> To use datasketches, I think I would need to instantiate 1 object per
> vector element, and I suspect this will slow things down considerably due
> to iterating over the objects when each vector is processed.  By
> comparison, the streaming-quantiles code could be easily modified to use
> Numpy arrays and operate on vectors.  I ran a few unit tests on both codes
> and found equivalent behavior, as expected.
>
> Do you have any recommendation(s) for this situation?  Are there known
> limitations of the streaming-quantiles code that would cause issues for my
> use case?  Are the other methods offered in datasketches 'better' than the
> KLL implemented in streaming-quantiles?  I'm quite out of my area of
> expertise, so I appreciate any advice you can offer, and I will of course
> acknowledge it in the publication.
>
> Best,
> Michael
>
> ------------------------------
> *From:* Edo Liberty <[email protected]>
> *Sent:* Tuesday, May 5, 2020 8:09 PM
> *To:* Lee Rhodes <[email protected]>; Michael Himes <
> [email protected]>
> *Cc:* [email protected] <[email protected]>
> *Subject:* Re: Permission to use KLL streaming-quantiles code in free
> open-source academic software
>
> +Lee
>
> Hi Michael, Thanks for reaching out.
> While you can certainly do that, I recommend using the python-Binded
> datasketches library. It will be more robust, faster, and bug free than my
> code :)
>
> On Tue, May 5, 2020 at 14:11 Michael Himes <[email protected]> wrote:
>
> Hi Edo,
>
> I'm currently working on a Python package for machine-learning-accelerated
> exoplanet modeling.  It is free and open source (see here if you're curious
> https://github.com/exosports/HOMER
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexosports%2FHOMER&data=02%7C01%7Cmhimes%40knights.ucf.edu%7C24e5e2d8c64f4f76f96f08d7f151bb62%7C5b16e18278b3412c919668342689eeb7%7C0%7C1%7C637243205647507766&sdata=BYuKhANiQF83uLoYdGS58mRRtCP6aDfG4f7Zg7CMd%2Bc%3D&reserved=0>),
> and it's meant purely for reproducible academic research.
>
> I'm adding some new features to the software, and one of them requires
> computing quantiles for a data set that cannot fit into memory.  After
> searching around for different methods to do this, your KLL method seemed
> to be a good option in terms of speed and space requirements.
>
> Rather than reinvent the wheel and code my own implementation of the
> method from scratch, I was wondering if you'd be willing to allow me to use
> your code?  I don't see a license, so I wanted to make sure you're okay
> with this.  I could implement it as a submodule within my repo, or I could
> only include the kll.py file and add some additional comments pointing to
> your repo and such, whichever you prefer.
>
> Best,
> Michael
>
>

Reply via email to