[Numpy-discussion] Re: Arrays of variable itemsize

Dom Grigonis Wed, 13 Mar 2024 14:39:00 -0700

Thanks for reiterating, this looks promising!

> On 13 Mar 2024, at 23:22, Jim Pivarski <jpivar...@gmail.com> wrote:
> 
> So that this doesn't get lost amid the discussion: 
> https://www.blosc.org/python-blosc2/python-blosc2.html 
> <https://www.blosc.org/python-blosc2/python-blosc2.html>
> 
> Blosc is on-the-fly compression, which is a more extreme way of making 
> variable-sized integers. The compression is in small chunks that fit into CPU 
> cachelines, such that it's random access per chunk. The compression is 
> lightweight enough that it can be faster to decompress, edit, and recompress 
> a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra 
> cost of compression is paid for by moving less data between RAM and CPU. 
> That's why I say "can be," because it depends on the entropy of the data.) 
> Since you have to copy data from RAM to CPU and back anyway, as a part of any 
> operation on an array, this can be a net win.
> 
> What you're trying to do with variable-length integers is a kind of 
> compression algorithm, an extremely lightweight one. That's why I think that 
> Blosc would fit your use-case, because it's doing the same kind of thing, but 
> with years of development behind it.
> 
> (Earlier, I recommended bcolz, which was a Python array based on Blosc, but 
> now I see that it has been deprecated. However, the link above goes to the 
> current version of the Python interface to Blosc, so I'd expect it to cover 
> the same use-cases.)
> 
> -- Jim
> 
> 
> 
> 
> 
> On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis <dom.grigo...@gmail.com 
> <mailto:dom.grigo...@gmail.com>> wrote:
> My array is growing in a manner of:
> array[slice] += values
> 
> so for now will just clip values:
> res = np.add(array[slice], values, dtype=np.int64)
> array[slice] = res
> mask = res > MAX_UINT16
> array[slice][mask] = MAX_UINT16
> 
> For this case, these large values do not have that much impact. And extra 
> operation overhead is acceptable.
> 
> ---
> 
> And adding more involved project to my TODOs for the future.
> 
> After all, it would be good to have an array, which (at preferably as minimal 
> cost as possible) could handle anything you throw at it with near-optimal 
> memory consumption and sensible precision handling, while keeping all the 
> benefits of numpy.
> 
> Time will tell if that is achievable. If anyone had any good ideas regarding 
> this I am all ears.
> 
> Much thanks to you all for information and ideas.
> dgpb
> 
>> On 13 Mar 2024, at 21:00, Homeier, Derek <dhom...@gwdg.de 
>> <mailto:dhom...@gwdg.de>> wrote:
>> 
>> On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigo...@gmail.com 
>> <mailto:dom.grigo...@gmail.com>> wrote:
>>> 
>>> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. 
>>> So it is not a solution for this case.
>>> 
>>> Nevertheless, such concept would still be worthwhile for cases where 
>>> integers are say max 256bits (or unlimited), then even if memory addresses 
>>> or offsets are 64bit. This would both:
>>> a) save memory if many of values in array are much smaller than 256bits
>>> b) provide a standard for dynamically unlimited size values
>> 
>> In principle one could encode individual offsets in a smarter way, using 
>> just the minimal number of bits required,
>> but again that would make random access impossible or very expensive – 
>> probably more or less amounting to
>> what smart compression algorithms are already doing.
>> Another approach might be to to use the mask approach after all (or just 
>> flag all you uint8 data valued 2**8 as
>> overflows) and store the correct (uint64 or whatever) values and their 
>> indices in a second array.
>> May still not vectorise very efficiently with just numpy if your typical 
>> operations are non-local.
>> 
>> Derek
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: jpivar...@gmail.com <mailto:jpivar...@gmail.com>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

Reply via email to