[Numpy-discussion] Re: Arrays of variable itemsize

Jim Pivarski Wed, 13 Mar 2024 10:05:18 -0700

After sending that email, I realize that I have to take it back: your
motivation is to minimize memory use. The variable-length lists in Awkward
Array (and therefore in ragged as well) are implemented using offset
arrays, and they're at minimum 32-bit. The scheme is more cache-coherent
(less "pointer chasing"), but doesn't reduce the size.


These offsets are 32-bit so that individual values can be selected from the
array in constant time. If you use a smaller integer size, like uint8, then
they have to be number of elements in the lists, rather than offsets (the
cumsum of number of elements in the lists). Then, to find a single value,
you have to add counts from the beginning of the array.

A standard way to store variable-length integers is to put the indicator of
whether you've seen the whole integer yet in a high bit (so each byte
effectively contributes 7 bits). That's also inherently non-random access.

But if random access is not a requirement, how about Blosc and bcolz?
That's a library that uses a very lightweight compression algorithm on the
arrays and uncompresses them on the fly (fast enough to be practical). That
sounds like it would fit your use-case better...

Jim




On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivar...@gmail.com> wrote:

> This might be a good application of Awkward Array (
> https://awkward-array.org), which applies a NumPy-like interface to
> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged),
> a restriction of that to only variable-length lists, but satisfying the
> Array API standard.
>
> The variable-length data in Awkward Array hasn't been used to represent
> arbitrary precision integers, though. It might be a good application of
> "behaviors," which are documented here:
> https://awkward-array.org/doc/main/reference/ak.behavior.html In
> principle, it would be possible to define methods and overload NumPy ufuncs
> to interpret variable-length lists of int8 as integers with arbitrary
> precision. Numba might be helpful in accelerating that if normal
> NumPy-style vectorization is insufficient.
>
> If you're interested in following this route, I can help with first
> implementations of that arbitrary precision integer behavior. (It's an
> interesting application!)
>
> Jim
>
>
>
> On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.pi...@gmail.com> wrote:
>
>> I am not sure what kind of a scheme would support various-sized native
>> ints. Any scheme that puts pointers in the array is going to be worse:
>> the pointers will be 64-bit. You could store offsets to data, but then
>> you would need to store both the offsets and the contiguous data, nearly
>> doubling your storage. What shape are your arrays, that would be the
>> minimum size of the offsets?
>>
>> Matti
>>
>>
>> On 13/3/24 18:15, Dom Grigonis wrote:
>> > By the way, I think I am referring to integer arrays. (Or integer part
>> > of floats.)
>> >
>> > I don’t think what I am saying sensibly applies to floats as they are.
>> >
>> > Although, new float type could base its integer part on such concept.
>> >
>> > —
>> >
>> > Where I am coming from is that I started to hit maximum bounds on
>> > integer arrays, where most of values are very small and some become
>> > very large. And I am hitting memory limits. And I don’t have many
>> > zeros, so sparse arrays aren’t an option.
>> >
>> > Approximately:
>> > 90% of my arrays could fit into `np.uint8`
>> > 1% requires `np.uint64`
>> > the rest 9% are in between.
>> >
>> > And there is no predictable order where is what, so splitting is not
>> > an option either.
>> >
>> >
>> >> On 13 Mar 2024, at 17:53, Nathan <nathan.goldb...@gmail.com> wrote:
>> >>
>> >> Yes, an array of references still has a fixed size width in the array
>> >> buffer. You can think of each entry in the array as a pointer to some
>> >> other memory on the heap, which can be a dynamic memory allocation.
>> >>
>> >> There's no way in NumPy to support variable-sized array elements in
>> >> the array buffer, since that assumption is key to how numpy
>> >> implements strided ufuncs and broadcasting.,
>> >>
>> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigo...@gmail.com>
>> >> wrote:
>> >>
>> >>     Thank you for this.
>> >>
>> >>     I am just starting to think about these things, so I appreciate
>> >>     your patience.
>> >>
>> >>     But isn’t it still true that all elements of an array are still
>> >>     of the same size in memory?
>> >>
>> >>     I am thinking along the lines of per-element dynamic memory
>> >>     management. Such that if I had array [0, 1e10000], the first
>> >>     element would default to reasonably small size in memory.
>> >>
>> >>>     On 13 Mar 2024, at 16:29, Nathan <nathan.goldb...@gmail.com>
>> wrote:
>> >>>
>> >>>     It is possible to do this using the new DType system.
>> >>>
>> >>>     Sebastian wrote a sketch for a DType backed by the GNU
>> >>>     multiprecision float library:
>> >>>     https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>> >>>
>> >>>     It adds a significant amount of complexity to store data outside
>> >>>     the array buffer and introduces the possibility of
>> >>>     use-after-free and dangling reference errors that are impossible
>> >>>     if the array does not use embedded references, so that’s the
>> >>>     main reason it hasn’t been done much.
>> >>>
>> >>>     On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
>> >>>     <dom.grigo...@gmail.com> wrote:
>> >>>
>> >>>         Hi all,
>> >>>
>> >>>         Say python’s builtin `int` type. It can be as large as
>> >>>         memory allows.
>> >>>
>> >>>         np.ndarray on the other hand is optimized for vectorization
>> >>>         via strides, memory structure and many things that I
>> >>>         probably don’t know. Well the point is that it is convenient
>> >>>         and efficient to use for many things in comparison to
>> >>>         python’s built-in list of integers.
>> >>>
>> >>>         So, I am thinking whether something in between exists? (And
>> >>>         obviously something more clever than np.array(dtype=object))
>> >>>
>> >>>         Probably something similar to `StringDType`, but for
>> >>>         integers and floats. (It’s just my guess. I don’t know
>> >>>         anything about `StringDType`, but just guessing it must be
>> >>>         better than np.array(dtype=object) in combination with
>> >>>         np.vectorize)
>> >>>
>> >>>         Regards,
>> >>>         dgpb
>> >>>
>> >>>         _______________________________________________
>> >>>         NumPy-Discussion mailing list -- numpy-discussion@python.org
>> >>>         To unsubscribe send an email to
>> >>>         numpy-discussion-le...@python.org
>> >>>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> >>>         Member address: nathan12...@gmail.com
>> >>>
>> >>>     _______________________________________________
>> >>>     NumPy-Discussion mailing list -- numpy-discussion@python.org
>> >>>     To unsubscribe send an email to numpy-discussion-le...@python.org
>> >>>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> >>>     Member address: dom.grigo...@gmail.com
>> >>
>> >>     _______________________________________________
>> >>     NumPy-Discussion mailing list -- numpy-discussion@python.org
>> >>     To unsubscribe send an email to numpy-discussion-le...@python.org
>> >>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> >>     Member address: nathan12...@gmail.com
>> >>
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> >> To unsubscribe send an email to numpy-discussion-le...@python.org
>> >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> >> Member address: dom.grigo...@gmail.com
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list -- numpy-discussion@python.org
>> > To unsubscribe send an email to numpy-discussion-le...@python.org
>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> > Member address: matti.pi...@gmail.com
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: jpivar...@gmail.com
>>
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

Reply via email to