[Numpy-discussion] Re: An extension of the .npy file format

Qianqian Fang Thu, 25 Aug 2022 12:51:26 -0700

On 8/25/22 12:25, Robert Kern wrote:

No one is really proposing another format, just a minor tweak to theexisting NPY format.

agreed. I was just following the previous comment on alternative formats(such as hdf5) and pros/cons of npy.

I don't quite know what this means. My installed version of `jq`, forexample, doesn't seem to know what to do with these files.
❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38

the .jdb files are binary JSON files (specifically BJData) that jq doesnot currently support; to save as text-based JSON, you change the suffixto .json or .jdt - it results in ~33% increase compared to the binarydue to base64


jd.save(y, 'eye5chunk_bjd_zlib.jdt',  {'compression':'zlib'});

13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb

jq . eye5chunk_bjd_zlib.jdt

[
  {
    "_ArrayType_": "double",
    "_ArraySize_": [
      200,
      1000
    ],
    "_ArrayZipType_": "zlib",
    "_ArrayZipSize_": [
      1,
      200000
    ],
    "_ArrayZipData_": "..."
   },
   ...

]

I think a fundamental problem here is that it looks like each elementin the array is delimited. I.e. a `float64` value starts with b'D'then the 8 IEEE-754 bytes representing the number. When we're talkingabout memory-mappability, we are talking about having the on-diskrepresentation being exactly what it looks like in-memory, all of theIEEE-754 floats contiguous with each other, so we can use the`np.memmap` `ndarray` subclass to represent the on-disk data as afirst-class array object. This spec lets us mmap the binary JSON fileand manipulate its contents in-place efficiently, but that's not whatis being asked for here.

there are several BJData-compliant forms to store the same binary arraylosslessly. The most memory efficient and disk-mmapable (but notnecessarily disk-efficient) form is to use the ND-array container syntax<https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type>that BJData spec extended over UBJSON.

For example, a 100x200x300 3D float64 ($D) array can be stored as below(numbers are stored in binary forms, white spaces should be removed)


|[$D #[$u#U3 100 200 300 value0 value1 ...|

where the "value_i"s are contiguous (row-major) binary stream of thefloat64 buffer without the delimited marker ('D') because it is absorbedto the optimized header<https://ubjson.org/type-reference/container-types/#optimized-format> ofthe array "[" following the type "$" marker. The data chunk ismmap-able, although if you desired a pre-determined initial offset, youcan force the dimension vector (#[$u #U 3 100 200 300) to be an integertype ($u) large enough, for example uint32 (m), then the starting offsetof the binary stream will be entirely predictable.


multiple ND arrays can be directly appended to the root level, for example,

|[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...|

can store 100x200x300 chunks of a 400x200x300 array

alternatively, one can also use an annotated format (in JSON form:|{"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}|)to store everything into 1D continuous buffer

|{|||U11 _ArrayType_ S U6 double |U11 _ArraySize_ [$u#U3 100 200 300 U11_ArrayData_ [$D #m 6000000 value1 value2 ...}|

The contiguous buffer in _ArrayData_ section is also disk-mmap-able; youcan also make additional requirements for the array metadata to ensure apredictable initial offset, if desired.

similarly, these annotated chunks can be appended in either JSON orbinary JSON forms, and the parsers can automatically handle both formsand convert them into the desired binary ND array with the expected typeand dimensions.

    here are the output file sizes in bytes:

    |8000128  eye5chunk.npy||
    ||5004297  eye5chunk_bjd_raw.jdb|
Just a note: This difference is solely due to a special representationof `0` in 5 bytes rather than 8 (essentially, your encoder recognizes0.0 as a special value and uses the `float32` encoding of it). If youhad any other value making up the bulk of the file, this would belarger than the NPY due to the additional delimiter b'D'.

the two BJData forms that I mentioned above (nd-array syntax orannotated array) will preserve the original precision/shape inround-trips. BJData follows the recommendations of the UBJSON spec andautomatically reduces data size<https://ubjson.org/type-reference/value-types/#:~:text=smallest%20numeric%20type>only if no precision loss (such as integer or zeros), but it is optional.

    |  10338  eye5chunk_bjd_zlib.jdb||
    ||   2206  eye5chunk_bjd_lzma.jdb|

    Qianqian

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

Reply via email to