Thanks for this.

Random access is unfortunately a requirement.

By the way, what is the difference between awkward and ragged?

> On 13 Mar 2024, at 18:59, Jim Pivarski <jpivar...@gmail.com> wrote:
> 
> After sending that email, I realize that I have to take it back: your 
> motivation is to minimize memory use. The variable-length lists in Awkward 
> Array (and therefore in ragged as well) are implemented using offset arrays, 
> and they're at minimum 32-bit. The scheme is more cache-coherent (less 
> "pointer chasing"), but doesn't reduce the size.
> 
> These offsets are 32-bit so that individual values can be selected from the 
> array in constant time. If you use a smaller integer size, like uint8, then 
> they have to be number of elements in the lists, rather than offsets (the 
> cumsum of number of elements in the lists). Then, to find a single value, you 
> have to add counts from the beginning of the array.
> 
> A standard way to store variable-length integers is to put the indicator of 
> whether you've seen the whole integer yet in a high bit (so each byte 
> effectively contributes 7 bits). That's also inherently non-random access.
> 
> But if random access is not a requirement, how about Blosc and bcolz? That's 
> a library that uses a very lightweight compression algorithm on the arrays 
> and uncompresses them on the fly (fast enough to be practical). That sounds 
> like it would fit your use-case better...
> 
> Jim
> 
> 
> 
> 
> On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivar...@gmail.com 
> <mailto:jpivar...@gmail.com>> wrote:
> This might be a good application of Awkward Array (https://awkward-array.org 
> <https://awkward-array.org/>), which applies a NumPy-like interface to 
> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged 
> <https://github.com/scikit-hep/ragged>), a restriction of that to only 
> variable-length lists, but satisfying the Array API standard.
> 
> The variable-length data in Awkward Array hasn't been used to represent 
> arbitrary precision integers, though. It might be a good application of 
> "behaviors," which are documented here: 
> https://awkward-array.org/doc/main/reference/ak.behavior.html 
> <https://awkward-array.org/doc/main/reference/ak.behavior.html> In principle, 
> it would be possible to define methods and overload NumPy ufuncs to interpret 
> variable-length lists of int8 as integers with arbitrary precision. Numba 
> might be helpful in accelerating that if normal NumPy-style vectorization is 
> insufficient.
> 
> If you're interested in following this route, I can help with first 
> implementations of that arbitrary precision integer behavior. (It's an 
> interesting application!)
> 
> Jim
> 
> 
> 
> On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.pi...@gmail.com 
> <mailto:matti.pi...@gmail.com>> wrote:
> I am not sure what kind of a scheme would support various-sized native 
> ints. Any scheme that puts pointers in the array is going to be worse: 
> the pointers will be 64-bit. You could store offsets to data, but then 
> you would need to store both the offsets and the contiguous data, nearly 
> doubling your storage. What shape are your arrays, that would be the 
> minimum size of the offsets?
> 
> Matti
> 
> 
> On 13/3/24 18:15, Dom Grigonis wrote:
> > By the way, I think I am referring to integer arrays. (Or integer part 
> > of floats.)
> >
> > I don’t think what I am saying sensibly applies to floats as they are.
> >
> > Although, new float type could base its integer part on such concept.
> >
> > —
> >
> > Where I am coming from is that I started to hit maximum bounds on 
> > integer arrays, where most of values are very small and some become 
> > very large. And I am hitting memory limits. And I don’t have many 
> > zeros, so sparse arrays aren’t an option.
> >
> > Approximately:
> > 90% of my arrays could fit into `np.uint8`
> > 1% requires `np.uint64`
> > the rest 9% are in between.
> >
> > And there is no predictable order where is what, so splitting is not 
> > an option either.
> >
> >
> >> On 13 Mar 2024, at 17:53, Nathan <nathan.goldb...@gmail.com 
> >> <mailto:nathan.goldb...@gmail.com>> wrote:
> >>
> >> Yes, an array of references still has a fixed size width in the array 
> >> buffer. You can think of each entry in the array as a pointer to some 
> >> other memory on the heap, which can be a dynamic memory allocation.
> >>
> >> There's no way in NumPy to support variable-sized array elements in 
> >> the array buffer, since that assumption is key to how numpy 
> >> implements strided ufuncs and broadcasting.,
> >>
> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigo...@gmail.com 
> >> <mailto:dom.grigo...@gmail.com>> 
> >> wrote:
> >>
> >>     Thank you for this.
> >>
> >>     I am just starting to think about these things, so I appreciate
> >>     your patience.
> >>
> >>     But isn’t it still true that all elements of an array are still
> >>     of the same size in memory?
> >>
> >>     I am thinking along the lines of per-element dynamic memory
> >>     management. Such that if I had array [0, 1e10000], the first
> >>     element would default to reasonably small size in memory.
> >>
> >>>     On 13 Mar 2024, at 16:29, Nathan <nathan.goldb...@gmail.com 
> >>> <mailto:nathan.goldb...@gmail.com>> wrote:
> >>>
> >>>     It is possible to do this using the new DType system.
> >>>
> >>>     Sebastian wrote a sketch for a DType backed by the GNU
> >>>     multiprecision float library:
> >>>     https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype 
> >>> <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
> >>>
> >>>     It adds a significant amount of complexity to store data outside
> >>>     the array buffer and introduces the possibility of
> >>>     use-after-free and dangling reference errors that are impossible
> >>>     if the array does not use embedded references, so that’s the
> >>>     main reason it hasn’t been done much.
> >>>
> >>>     On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
> >>>     <dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>> wrote:
> >>>
> >>>         Hi all,
> >>>
> >>>         Say python’s builtin `int` type. It can be as large as
> >>>         memory allows.
> >>>
> >>>         np.ndarray on the other hand is optimized for vectorization
> >>>         via strides, memory structure and many things that I
> >>>         probably don’t know. Well the point is that it is convenient
> >>>         and efficient to use for many things in comparison to
> >>>         python’s built-in list of integers.
> >>>
> >>>         So, I am thinking whether something in between exists? (And
> >>>         obviously something more clever than np.array(dtype=object))
> >>>
> >>>         Probably something similar to `StringDType`, but for
> >>>         integers and floats. (It’s just my guess. I don’t know
> >>>         anything about `StringDType`, but just guessing it must be
> >>>         better than np.array(dtype=object) in combination with
> >>>         np.vectorize)
> >>>
> >>>         Regards,
> >>>         dgpb
> >>>
> >>>         _______________________________________________
> >>>         NumPy-Discussion mailing list -- numpy-discussion@python.org 
> >>> <mailto:numpy-discussion@python.org>
> >>>         To unsubscribe send an email to
> >>>         numpy-discussion-le...@python.org 
> >>> <mailto:numpy-discussion-le...@python.org>
> >>>         
> >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> >>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> >>>         Member address: nathan12...@gmail.com 
> >>> <mailto:nathan12...@gmail.com>
> >>>
> >>>     _______________________________________________
> >>>     NumPy-Discussion mailing list -- numpy-discussion@python.org 
> >>> <mailto:numpy-discussion@python.org>
> >>>     To unsubscribe send an email to numpy-discussion-le...@python.org 
> >>> <mailto:numpy-discussion-le...@python.org>
> >>>     https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> >>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> >>>     Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> >>
> >>     _______________________________________________
> >>     NumPy-Discussion mailing list -- numpy-discussion@python.org 
> >> <mailto:numpy-discussion@python.org>
> >>     To unsubscribe send an email to numpy-discussion-le...@python.org 
> >> <mailto:numpy-discussion-le...@python.org>
> >>     https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> >> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> >>     Member address: nathan12...@gmail.com <mailto:nathan12...@gmail.com>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> >> <mailto:numpy-discussion@python.org>
> >> To unsubscribe send an email to numpy-discussion-le...@python.org 
> >> <mailto:numpy-discussion-le...@python.org>
> >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> >> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> >> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org 
> > <mailto:numpy-discussion@python.org>
> > To unsubscribe send an email to numpy-discussion-le...@python.org 
> > <mailto:numpy-discussion-le...@python.org>
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> > Member address: matti.pi...@gmail.com <mailto:matti.pi...@gmail.com>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: jpivar...@gmail.com <mailto:jpivar...@gmail.com>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to