Thanks for this. Random access is unfortunately a requirement.
By the way, what is the difference between awkward and ragged? > On 13 Mar 2024, at 18:59, Jim Pivarski <jpivar...@gmail.com> wrote: > > After sending that email, I realize that I have to take it back: your > motivation is to minimize memory use. The variable-length lists in Awkward > Array (and therefore in ragged as well) are implemented using offset arrays, > and they're at minimum 32-bit. The scheme is more cache-coherent (less > "pointer chasing"), but doesn't reduce the size. > > These offsets are 32-bit so that individual values can be selected from the > array in constant time. If you use a smaller integer size, like uint8, then > they have to be number of elements in the lists, rather than offsets (the > cumsum of number of elements in the lists). Then, to find a single value, you > have to add counts from the beginning of the array. > > A standard way to store variable-length integers is to put the indicator of > whether you've seen the whole integer yet in a high bit (so each byte > effectively contributes 7 bits). That's also inherently non-random access. > > But if random access is not a requirement, how about Blosc and bcolz? That's > a library that uses a very lightweight compression algorithm on the arrays > and uncompresses them on the fly (fast enough to be practical). That sounds > like it would fit your use-case better... > > Jim > > > > > On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivar...@gmail.com > <mailto:jpivar...@gmail.com>> wrote: > This might be a good application of Awkward Array (https://awkward-array.org > <https://awkward-array.org/>), which applies a NumPy-like interface to > arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged > <https://github.com/scikit-hep/ragged>), a restriction of that to only > variable-length lists, but satisfying the Array API standard. > > The variable-length data in Awkward Array hasn't been used to represent > arbitrary precision integers, though. It might be a good application of > "behaviors," which are documented here: > https://awkward-array.org/doc/main/reference/ak.behavior.html > <https://awkward-array.org/doc/main/reference/ak.behavior.html> In principle, > it would be possible to define methods and overload NumPy ufuncs to interpret > variable-length lists of int8 as integers with arbitrary precision. Numba > might be helpful in accelerating that if normal NumPy-style vectorization is > insufficient. > > If you're interested in following this route, I can help with first > implementations of that arbitrary precision integer behavior. (It's an > interesting application!) > > Jim > > > > On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.pi...@gmail.com > <mailto:matti.pi...@gmail.com>> wrote: > I am not sure what kind of a scheme would support various-sized native > ints. Any scheme that puts pointers in the array is going to be worse: > the pointers will be 64-bit. You could store offsets to data, but then > you would need to store both the offsets and the contiguous data, nearly > doubling your storage. What shape are your arrays, that would be the > minimum size of the offsets? > > Matti > > > On 13/3/24 18:15, Dom Grigonis wrote: > > By the way, I think I am referring to integer arrays. (Or integer part > > of floats.) > > > > I don’t think what I am saying sensibly applies to floats as they are. > > > > Although, new float type could base its integer part on such concept. > > > > — > > > > Where I am coming from is that I started to hit maximum bounds on > > integer arrays, where most of values are very small and some become > > very large. And I am hitting memory limits. And I don’t have many > > zeros, so sparse arrays aren’t an option. > > > > Approximately: > > 90% of my arrays could fit into `np.uint8` > > 1% requires `np.uint64` > > the rest 9% are in between. > > > > And there is no predictable order where is what, so splitting is not > > an option either. > > > > > >> On 13 Mar 2024, at 17:53, Nathan <nathan.goldb...@gmail.com > >> <mailto:nathan.goldb...@gmail.com>> wrote: > >> > >> Yes, an array of references still has a fixed size width in the array > >> buffer. You can think of each entry in the array as a pointer to some > >> other memory on the heap, which can be a dynamic memory allocation. > >> > >> There's no way in NumPy to support variable-sized array elements in > >> the array buffer, since that assumption is key to how numpy > >> implements strided ufuncs and broadcasting., > >> > >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigo...@gmail.com > >> <mailto:dom.grigo...@gmail.com>> > >> wrote: > >> > >> Thank you for this. > >> > >> I am just starting to think about these things, so I appreciate > >> your patience. > >> > >> But isn’t it still true that all elements of an array are still > >> of the same size in memory? > >> > >> I am thinking along the lines of per-element dynamic memory > >> management. Such that if I had array [0, 1e10000], the first > >> element would default to reasonably small size in memory. > >> > >>> On 13 Mar 2024, at 16:29, Nathan <nathan.goldb...@gmail.com > >>> <mailto:nathan.goldb...@gmail.com>> wrote: > >>> > >>> It is possible to do this using the new DType system. > >>> > >>> Sebastian wrote a sketch for a DType backed by the GNU > >>> multiprecision float library: > >>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype > >>> <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype> > >>> > >>> It adds a significant amount of complexity to store data outside > >>> the array buffer and introduces the possibility of > >>> use-after-free and dangling reference errors that are impossible > >>> if the array does not use embedded references, so that’s the > >>> main reason it hasn’t been done much. > >>> > >>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis > >>> <dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>> wrote: > >>> > >>> Hi all, > >>> > >>> Say python’s builtin `int` type. It can be as large as > >>> memory allows. > >>> > >>> np.ndarray on the other hand is optimized for vectorization > >>> via strides, memory structure and many things that I > >>> probably don’t know. Well the point is that it is convenient > >>> and efficient to use for many things in comparison to > >>> python’s built-in list of integers. > >>> > >>> So, I am thinking whether something in between exists? (And > >>> obviously something more clever than np.array(dtype=object)) > >>> > >>> Probably something similar to `StringDType`, but for > >>> integers and floats. (It’s just my guess. I don’t know > >>> anything about `StringDType`, but just guessing it must be > >>> better than np.array(dtype=object) in combination with > >>> np.vectorize) > >>> > >>> Regards, > >>> dgpb > >>> > >>> _______________________________________________ > >>> NumPy-Discussion mailing list -- numpy-discussion@python.org > >>> <mailto:numpy-discussion@python.org> > >>> To unsubscribe send an email to > >>> numpy-discussion-le...@python.org > >>> <mailto:numpy-discussion-le...@python.org> > >>> > >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > >>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > >>> Member address: nathan12...@gmail.com > >>> <mailto:nathan12...@gmail.com> > >>> > >>> _______________________________________________ > >>> NumPy-Discussion mailing list -- numpy-discussion@python.org > >>> <mailto:numpy-discussion@python.org> > >>> To unsubscribe send an email to numpy-discussion-le...@python.org > >>> <mailto:numpy-discussion-le...@python.org> > >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > >>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > >>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list -- numpy-discussion@python.org > >> <mailto:numpy-discussion@python.org> > >> To unsubscribe send an email to numpy-discussion-le...@python.org > >> <mailto:numpy-discussion-le...@python.org> > >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > >> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > >> Member address: nathan12...@gmail.com <mailto:nathan12...@gmail.com> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list -- numpy-discussion@python.org > >> <mailto:numpy-discussion@python.org> > >> To unsubscribe send an email to numpy-discussion-le...@python.org > >> <mailto:numpy-discussion-le...@python.org> > >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > >> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > >> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com> > > > > > > _______________________________________________ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > <mailto:numpy-discussion@python.org> > > To unsubscribe send an email to numpy-discussion-le...@python.org > > <mailto:numpy-discussion-le...@python.org> > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > > Member address: matti.pi...@gmail.com <mailto:matti.pi...@gmail.com> > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > <mailto:numpy-discussion@python.org> > To unsubscribe send an email to numpy-discussion-le...@python.org > <mailto:numpy-discussion-le...@python.org> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> > Member address: jpivar...@gmail.com <mailto:jpivar...@gmail.com> > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: dom.grigo...@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com