On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet <bi...@cern.ch> wrote:

> hi there,
>
> I have written a Go package[1] that can read/write simple arrays in the
> numpy file format [2].
> when I wrote it, it was for simple interoperability use cases, but now
> people would like to be able to read back ragged-arrays[3].
>
> unless I am mistaken, this means I need to interpret pieces of pickled
> data (`ndarray`, `multiarray` and `dtype`).
>
> so I am trying to understand how to unpickle `dtype` values that have been
> pickled:
>
> ```python
> import numpy as np
> import pickle
> import pickletools as pt
>
> pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True)
> ```
>
> gives:
> ```
>     0: \x80 PROTO      4 Protocol version indicator.
>     2: \x95 FRAME      55 Indicate the beginning of a new frame.
>    11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object.
>    18: \x94 MEMOIZE    (as 0)        Store the stack top into the memo.
> The stack is not popped.
>    19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object.
>    26: \x94 MEMOIZE    (as 1)        Store the stack top into the memo.
> The stack is not popped.
>    27: \x93 STACK_GLOBAL             Push a global object (module.attr) on
> the stack.
>    28: \x94 MEMOIZE    (as 2)        Store the stack top into the memo.
> The stack is not popped.
>    29: \x8c SHORT_BINUNICODE 'i4'    Push a Python Unicode string object.
>    33: \x94 MEMOIZE    (as 3)        Store the stack top into the memo.
> The stack is not popped.
>    34: \x89 NEWFALSE                 Push False onto the stack.
>    35: \x88 NEWTRUE                  Push True onto the stack.
>    36: \x87 TUPLE3                   Build a three-tuple out of the top
> three items on the stack.
>    37: \x94 MEMOIZE    (as 4)        Store the stack top into the memo.
> The stack is not popped.
>    38: R    REDUCE                   Push an object built from a callable
> and an argument tuple.
>    39: \x94 MEMOIZE    (as 5)        Store the stack top into the memo.
> The stack is not popped.
>    40: (    MARK                     Push markobject onto the stack.
>    41: K        BININT1    3         Push a one-byte unsigned integer.
>    43: \x8c     SHORT_BINUNICODE '<' Push a Python Unicode string object.
>    46: \x94     MEMOIZE    (as 6)    Store the stack top into the memo.
> The stack is not popped.
>    47: N        NONE                 Push None on the stack.
>    48: N        NONE                 Push None on the stack.
>    49: N        NONE                 Push None on the stack.
>    50: J        BININT     -1        Push a four-byte signed integer.
>    55: J        BININT     -1        Push a four-byte signed integer.
>    60: K        BININT1    0         Push a one-byte unsigned integer.
>    62: t        TUPLE      (MARK at 40) Build a tuple out of the topmost
> stack slice, after markobject.
>    63: \x94 MEMOIZE    (as 7)           Store the stack top into the
> memo.  The stack is not popped.
>    64: b    BUILD                       Finish building an object, via
> __setstate__ or dict update.
>    65: .    STOP                        Stop the unpickling machine.
> highest protocol among opcodes = 4
> ```
>
> I have tried to find the usual `__reduce__` and `__setstate__` methods to
> understand what are the various arguments, to no avail.
>

First, be sure to read the generic `object.__reduce__` docs:

https://docs.python.org/3.11/library/pickle.html#object.__reduce__

Here is the C source for `np.dtype.__reduce__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750

And `np.dtype.__setstate__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151

so, in :
> ```python
> >>> np.dtype("int32").__reduce__()[1]
> ('i4', False, True)
>

These are arguments to the `np.dtype` constructor and are documented in
`np.dtype.__doc__`. The `False, True` arguments are hardcoded and always
those values.


> >>> np.dtype("int32").__reduce__()[2]
> (3, '<', None, None, None, -1, -1, 0)
>

These are arguments to pass to `np.dtype.__setstate__()` after the object
has been created.

0. `3` is the version number of the state; `3` is typical for simple
dtypes; datetimes and others with metadata will bump this to `4` and use a
9-element tuple instead of this 8-element tuple.
1. `'<'` is the endianness flag.
2. If there are subarrays
<https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g.
`np.dtype((np.int32, (2,2)))`), that info here.
3. If there are fields, a tuple of the names of the fields
4. If there are fields, the field descriptor dict.
5. If extended dtype (e.g. fields, strings, void, etc.), the element size,
else `-1`.
6. If extended dtype, the alignment flag, else `-1`.
7. The `flags` bit-flags; see `np.dtype.flags.__doc__`.
8. If datetime or with metadata, that metadata here, else absent.

-- 
Robert Kern
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to