So that this doesn't get lost amid the discussion: https://www.blosc.org/python-blosc2/python-blosc2.html
Blosc is on-the-fly compression, which is a more extreme way of making variable-sized integers. The compression is in small chunks that fit into CPU cachelines, such that it's random access per chunk. The compression is lightweight enough that it can be faster to decompress, edit, and recompress a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra cost of compression is paid for by moving less data between RAM and CPU. That's why I say "can be," because it depends on the entropy of the data.) Since you have to copy data from RAM to CPU and back anyway, as a part of any operation on an array, this can be a net win. What you're trying to do with variable-length integers is a kind of compression algorithm, an extremely lightweight one. That's why I think that Blosc would fit your use-case, because it's doing the same kind of thing, but with years of development behind it. (Earlier, I recommended bcolz, which was a Python array based on Blosc, but now I see that it has been deprecated. However, the link above goes to the current version of the Python interface to Blosc, so I'd expect it to cover the same use-cases.) -- Jim On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis <dom.grigo...@gmail.com> wrote: > My array is growing in a manner of: > array[slice] += values > > so for now will just clip values: > res = np.add(array[slice], values, dtype=np.int64) > array[slice] = res > mask = res > MAX_UINT16 > array[slice][mask] = MAX_UINT16 > > For this case, these large values do not have that much impact. And extra > operation overhead is acceptable. > > --- > > And adding more involved project to my TODOs for the future. > > After all, it would be good to have an array, which (at preferably as > minimal cost as possible) could handle anything you throw at it with > near-optimal memory consumption and sensible precision handling, while > keeping all the benefits of numpy. > > Time will tell if that is achievable. If anyone had any good ideas > regarding this I am all ears. > > Much thanks to you all for information and ideas. > dgpb > > On 13 Mar 2024, at 21:00, Homeier, Derek <dhom...@gwdg.de> wrote: > > On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigo...@gmail.com> wrote: > > > So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. > So it is not a solution for this case. > > Nevertheless, such concept would still be worthwhile for cases where > integers are say max 256bits (or unlimited), then even if memory addresses > or offsets are 64bit. This would both: > a) save memory if many of values in array are much smaller than 256bits > b) provide a standard for dynamically unlimited size values > > > In principle one could encode individual offsets in a smarter way, using > just the minimal number of bits required, > but again that would make random access impossible or very expensive – > probably more or less amounting to > what smart compression algorithms are already doing. > Another approach might be to to use the mask approach after all (or just > flag all you uint8 data valued 2**8 as > overflows) and store the correct (uint64 or whatever) values and their > indices in a second array. > May still not vectorise very efficiently with just numpy if your typical > operations are non-local. > > Derek > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: dom.grigo...@gmail.com > > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: jpivar...@gmail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com