[Numpy-discussion] Re: Arrays of variable itemsize

Jim Pivarski Wed, 13 Mar 2024 14:29:12 -0700

So that this doesn't get lost amid the discussion:
https://www.blosc.org/python-blosc2/python-blosc2.html

Blosc is on-the-fly compression, which is a more extreme way of making
variable-sized integers. The compression is in small chunks that fit into
CPU cachelines, such that it's random access per chunk. The compression is
lightweight enough that it can be faster to decompress, edit, and
recompress a chunk than it is to copy from RAM, edit, and copy back to RAM.
(The extra cost of compression is paid for by moving less data between RAM
and CPU. That's why I say "can be," because it depends on the entropy of
the data.) Since you have to copy data from RAM to CPU and back anyway, as
a part of any operation on an array, this can be a net win.

What you're trying to do with variable-length integers is a kind of
compression algorithm, an extremely lightweight one. That's why I think
that Blosc would fit your use-case, because it's doing the same kind of
thing, but with years of development behind it.

(Earlier, I recommended bcolz, which was a Python array based on Blosc, but
now I see that it has been deprecated. However, the link above goes to the
current version of the Python interface to Blosc, so I'd expect it to cover
the same use-cases.)

-- Jim

On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis <[email protected]> wrote:

> My array is growing in a manner of:
> array[slice] += values
>
> so for now will just clip values:
> res = np.add(array[slice], values, dtype=np.int64)
> array[slice] = res
> mask = res > MAX_UINT16
> array[slice][mask] = MAX_UINT16
>
> For this case, these large values do not have that much impact. And extra
> operation overhead is acceptable.
>
> ---
>
> And adding more involved project to my TODOs for the future.
>
> After all, it would be good to have an array, which (at preferably as
> minimal cost as possible) could handle anything you throw at it with
> near-optimal memory consumption and sensible precision handling, while
> keeping all the benefits of numpy.
>
> Time will tell if that is achievable. If anyone had any good ideas
> regarding this I am all ears.
>
> Much thanks to you all for information and ideas.
> dgpb
>
> On 13 Mar 2024, at 21:00, Homeier, Derek <[email protected]> wrote:
>
> On 13 Mar 2024, at 6:01 PM, Dom Grigonis <[email protected]> wrote:
>
>
> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed.
> So it is not a solution for this case.
>
> Nevertheless, such concept would still be worthwhile for cases where
> integers are say max 256bits (or unlimited), then even if memory addresses
> or offsets are 64bit. This would both:
> a) save memory if many of values in array are much smaller than 256bits
> b) provide a standard for dynamically unlimited size values
>
>
> In principle one could encode individual offsets in a smarter way, using
> just the minimal number of bits required,
> but again that would make random access impossible or very expensive –
> probably more or less amounting to
> what smart compression algorithms are already doing.
> Another approach might be to to use the mask approach after all (or just
> flag all you uint8 data valued 2**8 as
> overflows) and store the correct (uint64 or whatever) values and their
> indices in a second array.
> May still not vectorise very efficiently with just numpy if your typical
> operations are non-local.
>
> Derek
>
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Arrays of variable itemsize

Reply via email to