On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet <bi...@cern.ch> wrote:
> hi there, > > I have written a Go package[1] that can read/write simple arrays in the > numpy file format [2]. > when I wrote it, it was for simple interoperability use cases, but now > people would like to be able to read back ragged-arrays[3]. > > unless I am mistaken, this means I need to interpret pieces of pickled > data (`ndarray`, `multiarray` and `dtype`). > > so I am trying to understand how to unpickle `dtype` values that have been > pickled: > > ```python > import numpy as np > import pickle > import pickletools as pt > > pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True) > ``` > > gives: > ``` > 0: \x80 PROTO 4 Protocol version indicator. > 2: \x95 FRAME 55 Indicate the beginning of a new frame. > 11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object. > 18: \x94 MEMOIZE (as 0) Store the stack top into the memo. > The stack is not popped. > 19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object. > 26: \x94 MEMOIZE (as 1) Store the stack top into the memo. > The stack is not popped. > 27: \x93 STACK_GLOBAL Push a global object (module.attr) on > the stack. > 28: \x94 MEMOIZE (as 2) Store the stack top into the memo. > The stack is not popped. > 29: \x8c SHORT_BINUNICODE 'i4' Push a Python Unicode string object. > 33: \x94 MEMOIZE (as 3) Store the stack top into the memo. > The stack is not popped. > 34: \x89 NEWFALSE Push False onto the stack. > 35: \x88 NEWTRUE Push True onto the stack. > 36: \x87 TUPLE3 Build a three-tuple out of the top > three items on the stack. > 37: \x94 MEMOIZE (as 4) Store the stack top into the memo. > The stack is not popped. > 38: R REDUCE Push an object built from a callable > and an argument tuple. > 39: \x94 MEMOIZE (as 5) Store the stack top into the memo. > The stack is not popped. > 40: ( MARK Push markobject onto the stack. > 41: K BININT1 3 Push a one-byte unsigned integer. > 43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object. > 46: \x94 MEMOIZE (as 6) Store the stack top into the memo. > The stack is not popped. > 47: N NONE Push None on the stack. > 48: N NONE Push None on the stack. > 49: N NONE Push None on the stack. > 50: J BININT -1 Push a four-byte signed integer. > 55: J BININT -1 Push a four-byte signed integer. > 60: K BININT1 0 Push a one-byte unsigned integer. > 62: t TUPLE (MARK at 40) Build a tuple out of the topmost > stack slice, after markobject. > 63: \x94 MEMOIZE (as 7) Store the stack top into the > memo. The stack is not popped. > 64: b BUILD Finish building an object, via > __setstate__ or dict update. > 65: . STOP Stop the unpickling machine. > highest protocol among opcodes = 4 > ``` > > I have tried to find the usual `__reduce__` and `__setstate__` methods to > understand what are the various arguments, to no avail. > First, be sure to read the generic `object.__reduce__` docs: https://docs.python.org/3.11/library/pickle.html#object.__reduce__ Here is the C source for `np.dtype.__reduce__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750 And `np.dtype.__setstate__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151 so, in : > ```python > >>> np.dtype("int32").__reduce__()[1] > ('i4', False, True) > These are arguments to the `np.dtype` constructor and are documented in `np.dtype.__doc__`. The `False, True` arguments are hardcoded and always those values. > >>> np.dtype("int32").__reduce__()[2] > (3, '<', None, None, None, -1, -1, 0) > These are arguments to pass to `np.dtype.__setstate__()` after the object has been created. 0. `3` is the version number of the state; `3` is typical for simple dtypes; datetimes and others with metadata will bump this to `4` and use a 9-element tuple instead of this 8-element tuple. 1. `'<'` is the endianness flag. 2. If there are subarrays <https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g. `np.dtype((np.int32, (2,2)))`), that info here. 3. If there are fields, a tuple of the names of the fields 4. If there are fields, the field descriptor dict. 5. If extended dtype (e.g. fields, strings, void, etc.), the element size, else `-1`. 6. If extended dtype, the alignment flag, else `-1`. 7. The `flags` bit-flags; see `np.dtype.flags.__doc__`. 8. If datetime or with metadata, that metadata here, else absent. -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com