Hi all, an advantage of sub-byte datatypes is the potential for accelerated computing. For GPUs, int4 is already happening. Or take int1 for example: if one had two arrays of size 64, that would be eight bytes. Now, if one wanted to add those two arrays, one could simply xor them as a uint64 (or 8x uint8 xor).
However, I would rather limit sub-bytetypes to int1, (u)int2 and (u)int4, as they are the only ones that divide the byte evenly (or to begin with at least). Considering single element access: a single element in such an array could be accessed by dividing the index, e.g. and ANDing with a mask. Probably uint8 would make sense for this. That would create some overhead of course, but the data is more compact (which is nice for CPU/GPU cache) and full-array ops are faster. Striding could be done similarly to single element access. This would be inefficient as well, but one could auto-generate some type specific C code (for int1, (u)int2, (u)int4 and their combinations) that accelerates popular operators. So one would not need to actually loop over every entry with single element access. „byte size strided“: isn‘t it possible to pre-process the strides and post-process the output as mentioned above? Like a wrapping class around a uint8 array. What do you think? Am I missing out on something? Best, Michael > On 11. Nov 2022, at 18:23, Sebastian Berg <sebast...@sipsolutions.net> wrote: > > On Fri, 2022-11-11 at 09:13 -0700, Greg Lucas wrote: >>> >>> OK, more below. But unfortunately `int2` and `int4` *are* >>> problematic, >>> because the NumPy array uses a byte-sized strided layout, so you >>> would >>> have to store them in a full byte, which is probably not what you >>> want. >> >> >>> I am always thinking of adding a provision for it in the DTypes so >>> that >>> someone could use part of the NumPy machine to make an array that >>> can >>> have non-byte sized strides, but the NumPy array itself is ABI >>> incompatible with storing these packed :(. >> >> >> >> (I.e. we could plug that "hole" to allow making an int4 DType in >> NumPy, >>> but it would still have to take 1-byte storage space when put into >>> a >>> NumPy array, so I am not sure there is much of a point.) >> >> >> >> >> I have also been curious about the new DTypes mechanism and whether >> we >> could do non byte-size DTypes with it. One use-case I have >> specifically is >> for reading and writing non byte-aligned data [1]. So, this would >> work very >> well for that use-case if the dtype knew how to read/write the >> proper bit-size. For my use-case I wouldn't care too much if >> internally >> Numpy needs to expand and store the data as full bytes, but being >> able to >> read a bitwise binary stream into Numpy native dtypes for further >> processing would be useful I think (without having to resort to >> unpackbits >> and do rearranging/packing to other types). >> >> dtype = {'names': ('count0', 'count1'), 'formats': ('uint3', >> 'uint5')} >> # x would have two unsigned ints, but reading only one byte from the >> stream >> x = np.frombuffer(buffer, dtype) >> # would be ideal to get tobytes() to know how to pack a uint3+uint5 >> DType >> into a single byte as well >> x.tobytes() > > > Unfortunately, I suspect the amount of expectations users would have > from a full DType, and the fact that bit-sized will be a bit awkward in > NumPy arrays for the forseeable future makes me think dedicated > conversion functions are probably more practical. > > Yes, you could do a `MyInt(bits=5, offset=3)` DType and at least you > could view the same array also with `MyInt(bits=3, offset=0)`. (Maybe > also structured DType, but I am not certain that is advisable and > custom structured DTypes would require holes to be plucked). > > A custom dtype that is "structured" might work (i.e. you could store > two numbers in one byte of course). > Currently you cannot integrate deep enough into NumPy to build > structured dtypes based on arbitrary other dtypes, but you could do it > for your own bit DType. > (I am not quite sure you can make `arr["count0"]` work, this is a hole > that needs plucking.) > > This is probably not a small task though. > > > Could `tobytes()` be made to compactify? Yes, but then it suddenly > needs extra logic for bit-sized and doesn't just expose memory. That > is maybe fine, but also seems a bit awkward? > > I would love to have a better answer, but dancing around the byte- > strided ABI seems tricky... > > Anyway, I am always available to discuss such possibilities, there are > some corners w.r.t. to such bit-sized thoughts which are still shrouded > in fog. > > - Sebastian > > >> >> Greg >> >> [1] Specifically, this is for very low bandwidth satellite data where >> we >> try to pack as much information in the downlink and use every bit of >> space, >> but once on the ground I can expand the bit-size fields to byte-size >> fields >> without too much issue of worrying about space [puns intended]. >> >> >>> On Fri, Nov 11, 2022 at 7:14 AM Sebastian Berg < >>> sebast...@sipsolutions.net> >>> wrote: >>> >>> On Fri, 2022-11-11 at 14:55 +0100, Oscar Gustafsson wrote: >>>> Thanks! That does indeed look like a promising approach! And for >>>> sure >>>> it >>>> would be better to avoid having to reimplement the whole array- >>>> part >>>> and >>>> only focus on the data types. (If successful, my idea of a >>>> project >>>> would >>>> basically solve all the custom numerical types discussed, >>>> bfloat16, >>>> int2, >>>> int4 etc.) >>> >>> OK, more below. But unfortunately `int2` and `int4` *are* >>> problematic, >>> because the NumPy array uses a byte-sized strided layout, so you >>> would >>> have to store them in a full byte, which is probably not what you >>> want. >>> >>> I am always thinking of adding a provision for it in the DTypes so >>> that >>> someone could use part of the NumPy machine to make an array that >>> can >>> have non-byte sized strides, but the NumPy array itself is ABI >>> incompatible with storing these packed :(. >>> >>> (I.e. we could plug that "hole" to allow making an int4 DType in >>> NumPy, >>> but it would still have to take 1-byte storage space when put into >>> a >>> NumPy array, so I am not sure there is much of a point.) >>> >>>> >>>> I understand that the following is probably a hard question to >>>> answer, but >>>> is it expected that there will be work done on this in the "near" >>>> future >>>> to fill any holes and possibly become more stable? For context, >>>> the >>>> current >>>> plan on my side is to propose this as a student project for the >>>> spring, so >>>> primarily asking for planning and describing the project a bit >>>> better. >>> >>> >>> Well, it depends on what you need. With the exception above, I >>> doubt >>> the "holes" will matter much practice unless you are targeting for >>> a >>> polished release rather than experimentation. >>> But of course it may be that you run into something that is >>> important >>> for you, but doesn't yet quite work. >>> >>> I will note just dealing with the Python/NumPy C-API can be a >>> fairly >>> steep learning curve, so you need someone comfortable to dive in >>> and >>> budget a good amount of time for that part. >>> And yes, this is pretty new, so there may be stumbling stones >>> (which I >>> am happy to discuss in NumPy issues or directly). >>> >>> - Sebastian >>> >>> >>>> >>>> BR Oscar >>>> >>>> Den tors 10 nov. 2022 kl 15:13 skrev Sebastian Berg < >>>> sebast...@sipsolutions.net>: >>>> >>>>> On Thu, 2022-11-10 at 14:55 +0100, Oscar Gustafsson wrote: >>>>>> Den tors 10 nov. 2022 kl 13:10 skrev Sebastian Berg < >>>>>> sebast...@sipsolutions.net>: >>>>>> >>>>>>> On Thu, 2022-11-10 at 11:08 +0100, Oscar Gustafsson wrote: >>>>>>>>> >>>>>>>>> I'm not an expert, but I never encountered rounding >>>>>>>>> floating >>>>>>>>> point >>>>>>>>> numbers >>>>>>>>> in bases different from 2 and 10. >>>>>>>>> >>>>>>>> >>>>>>>> I agree that this is probably not very common. More a >>>>>>>> possibility >>>>>>>> if >>>>>>>> one >>>>>>>> would supply a base argument to around. >>>>>>>> >>>>>>>> However, it is worth noting that Matlab has the quant >>>>>>>> function, >>>>>>>> https://www.mathworks.com/help/deeplearning/ref/quant.html >>>>>>>> wh >>>>>>>> ich >>>>>>>> basically >>>>>>>> supports arbitrary bases (as a special case of an even >>>>>>>> more >>>>>>>> general >>>>>>>> approach). So there may be other use cases (although the >>>>>>>> example >>>>>>>> basically >>>>>>>> just implements around(x, 1)). >>>>>>> >>>>>>> >>>>>>> To be honest, hearing hardware design and data compression >>>>>>> does >>>>>>> make me >>>>>>> lean towards it not being mainstream enough that inclusion >>>>>>> in >>>>>>> NumPy >>>>>>> really makes sense. But happy to hear opposing opinions. >>>>>>> >>>>>> >>>>>> Here I can easily argue that "all" computations are limited >>>>>> by >>>>>> finite >>>>>> word >>>>>> length and as soon as you want to see the effect of any type >>>>>> of >>>>>> format not >>>>>> supported out of the box, it will be beneficial. (Strictly, >>>>>> it >>>>>> makes >>>>>> more >>>>>> sense to quantize to a given number of bits than a given >>>>>> number >>>>>> of >>>>>> decimal >>>>>> digits, as we cannot represent most of those exactly.) But I >>>>>> may >>>>>> not >>>>>> do >>>>>> that. >>>>>> >>>>>> >>>>>>> It would be nice to have more of a culture around ufuncs >>>>>>> that >>>>>>> do >>>>>>> not >>>>>>> live in NumPy. (I suppose at some point it was more >>>>>>> difficult >>>>>>> to >>>>>>> do C- >>>>>>> extension, but that is many years ago). >>>>>>> >>>>>> >>>>>> I do agree with this though. And this got me realizing that >>>>>> maybe >>>>>> what I >>>>>> actually would like to do is to create an array-library with >>>>>> fully >>>>>> customizable (numeric) data types instead. That is, sort of, >>>>>> the >>>>>> proper way >>>>>> to do it, although the proposed approach is indeed simpler >>>>>> and in >>>>>> most >>>>>> cases will work well enough. >>>>>> >>>>>> (Am I right in believing that it is not that easy to piggy- >>>>>> back >>>>>> custom data >>>>>> types onto NumPy arrays? Something different from using >>>>>> object as >>>>>> dtype or >>>>>> the "struct-like" custom approach using the existing scalar >>>>>> types.) >>>>> >>>>> NumPy is pretty much fully customizeable (beyond just numeric >>>>> data >>>>> types). >>>>> Admittedly, to not have weird edge cases and have more power >>>>> you >>>>> have >>>>> to use the new API (NEP 41-43 [1]) and that is "experimental" >>>>> and >>>>> may >>>>> have some holes. >>>>> "Experimental" doesn't mean it is expected to change >>>>> significantly, >>>>> just that you can't ship your stuff broadly really. >>>>> >>>>> The holes may matter for some complicated dtypes (custom memory >>>>> allocation, parametric...). But at this point many should be >>>>> rather >>>>> fixable, so before you do your own give NumPy a chance? >>>>> >>>>> - Sebastian >>>>> >>>>> >>>>> [1] https://numpy.org/neps/nep-0041-improved-dtype-support.html >>>>> >>>>>> >>>>>> BR Oscar Gustafsson >>>>>> _______________________________________________ >>>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>>>>> To unsubscribe send an email to >>>>>> numpy-discussion-le...@python.org >>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>>>>> Member address: sebast...@sipsolutions.net >>>>> >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>>>> To unsubscribe send an email to >>>>> numpy-discussion-le...@python.org >>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>>>> Member address: oscar.gustafs...@gmail.com >>>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>>> To unsubscribe send an email to numpy-discussion-le...@python.org >>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>>> Member address: sebast...@sipsolutions.net >>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>> To unsubscribe send an email to numpy-discussion-le...@python.org >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: greg.m.lu...@gmail.com >>> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: sebast...@sipsolutions.net > > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: michael.sieber...@gmail.com _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com