On 8/25/22 12:25, Robert Kern wrote:
No one is really proposing another format, just a minor tweak to the
existing NPY format.
agreed. I was just following the previous comment on alternative formats
(such as hdf5) and pros/cons of npy.
I don't quite know what this means. My installed version of `jq`, for
example, doesn't seem to know what to do with these files.
❯ jq --version
jq-1.6
❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38
the .jdb files are binary JSON files (specifically BJData) that jq does
not currently support; to save as text-based JSON, you change the suffix
to .json or .jdt - it results in ~33% increase compared to the binary
due to base64
jd.save(y, 'eye5chunk_bjd_zlib.jdt', {'compression':'zlib'});
13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb
jq . eye5chunk_bjd_zlib.jdt
[
{
"_ArrayType_": "double",
"_ArraySize_": [
200,
1000
],
"_ArrayZipType_": "zlib",
"_ArrayZipSize_": [
1,
200000
],
"_ArrayZipData_": "..."
},
...
]
I think a fundamental problem here is that it looks like each element
in the array is delimited. I.e. a `float64` value starts with b'D'
then the 8 IEEE-754 bytes representing the number. When we're talking
about memory-mappability, we are talking about having the on-disk
representation being exactly what it looks like in-memory, all of the
IEEE-754 floats contiguous with each other, so we can use the
`np.memmap` `ndarray` subclass to represent the on-disk data as a
first-class array object. This spec lets us mmap the binary JSON file
and manipulate its contents in-place efficiently, but that's not what
is being asked for here.
there are several BJData-compliant forms to store the same binary array
losslessly. The most memory efficient and disk-mmapable (but not
necessarily disk-efficient) form is to use the ND-array container syntax
<https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type>
that BJData spec extended over UBJSON.
For example, a 100x200x300 3D float64 ($D) array can be stored as below
(numbers are stored in binary forms, white spaces should be removed)
|[$D #[$u#U3 100 200 300 value0 value1 ...|
where the "value_i"s are contiguous (row-major) binary stream of the
float64 buffer without the delimited marker ('D') because it is absorbed
to the optimized header
<https://ubjson.org/type-reference/container-types/#optimized-format> of
the array "[" following the type "$" marker. The data chunk is
mmap-able, although if you desired a pre-determined initial offset, you
can force the dimension vector (#[$u #U 3 100 200 300) to be an integer
type ($u) large enough, for example uint32 (m), then the starting offset
of the binary stream will be entirely predictable.
multiple ND arrays can be directly appended to the root level, for example,
|[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...|
can store 100x200x300 chunks of a 400x200x300 array
alternatively, one can also use an annotated format (in JSON form:
|{"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}|)
to store everything into 1D continuous buffer
|{|||U11 _ArrayType_ S U6 double |U11 _ArraySize_ [$u#U3 100 200 300 U11
_ArrayData_ [$D #m 6000000 value1 value2 ...}|
The contiguous buffer in _ArrayData_ section is also disk-mmap-able; you
can also make additional requirements for the array metadata to ensure a
predictable initial offset, if desired.
similarly, these annotated chunks can be appended in either JSON or
binary JSON forms, and the parsers can automatically handle both forms
and convert them into the desired binary ND array with the expected type
and dimensions.
here are the output file sizes in bytes:
|8000128 eye5chunk.npy||
||5004297 eye5chunk_bjd_raw.jdb|
Just a note: This difference is solely due to a special representation
of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes
0.0 as a special value and uses the `float32` encoding of it). If you
had any other value making up the bulk of the file, this would be
larger than the NPY due to the additional delimiter b'D'.
the two BJData forms that I mentioned above (nd-array syntax or
annotated array) will preserve the original precision/shape in
round-trips. BJData follows the recommendations of the UBJSON spec and
automatically reduces data size
<https://ubjson.org/type-reference/value-types/#:~:text=smallest%20numeric%20type>
only if no precision loss (such as integer or zeros), but it is optional.
| 10338 eye5chunk_bjd_zlib.jdb||
|| 2206 eye5chunk_bjd_lzma.jdb|
Qianqian
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com