Hi all--
I've been getting started with Parquet as a storage alternative to HDF5 and
it has a lot of attractive quantities including compression flexibility
efficiency.
But I'm stumped for storage efficiency in Parquet with one type of data
that I have.
This is a large series of "ragged" packets arriving as a stream, where each
packet consists of up to 255 bytes of binary data. The vast majority of the
packets have lengths between 96 and 112 bytes. I need to store each of them
with a 64-bit timestamp.
I can get a good storage efficiency with HDF5 with the following table
schema using pytables:
class StoredPacket(pt.IsDescription):
timetick = pt.UInt64Col(pos=0)
length = pt.UInt16Col(pos=1)
data = pt.UInt8Col(pos=2,shape=(255,))
This stores packet data as an array of uint8 with length 255. I zero-pad
the packet to length 255 and store the length as well in a separate column.
I have created a sample file in a Github gist:
https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see
test1.h5) along with a Python script convert_test1.py that converts it to a
Pandas DataFrame and stores it via Parquet. But the Parquet files are
almost twice as large as the .h5 file no matter what storage technique I
use; brotli is best but slow, and zstd is almost as good as brotli but much
faster.
Any suggestions on how I might improve storage efficiency in Parquet? I
have a lot of flexibility with how I can store the data; my only
requirement is that I can retrieve the data packets quickly from the
storage file. I offer this sample file as a test case.
(py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
Table overview:
timetick length data
0 16 99 b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
1 32 99 b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
2 48 99 b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
3 64 99 b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
4 80 99 b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
... ... ... ...
16413 262080 99 b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
16414 262096 99 b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
16415 262112 99 b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
16416 262128 99 b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
16417 262144 99 b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...
[16418 rows x 3 columns]
Packets with tags >= 128:
timetick length data
179 2864 36 b"\xca'Twas brillig, and the slithy toves\x00\...
307 4896 35 b'\xca Did gyre and gimble in the wabe:\x00\x...
340 5408 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
362 5744 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
651 10352 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1403 22368 32 b'\xca"Beware the Jabberwock, my son!\x00\x00\...
1741 27760 44 b'\xca The jaws that bite, the claws that cat...
2115 33728 33 b'\xcaBeware the Jubjub bird, and shun\x00\x00...
2162 34464 30 b'\xca The frumious Bandersnatch!"\x00\x00\x0...
2278 36304 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
2405 38320 34 b'\xcaHe took his vorpal sword in hand:\x00\x0...
2675 42624 41 b'\xca Long time the manxome foe he sought --...
2896 46144 33 b'\xcaSo rested he by the Tumtum tree,\x00\x00...
3611 57568 31 b'\xca And stood awhile in thought.\x00\x00\x...
4089 65200 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
5231 83456 36 b'\xcaAnd, as in uffish thought he stood,\x00\...
5236 83520 38 b'\xca The Jabberwock, with eyes of flame,\x0...
5427 86560 40 b'\xcaCame whiffling through the tulgey wood,\...
6904 110176 26 b'\xca And burbled as it came!\x00\x00\x00\x0...
7003 111744 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
7286 116256 44 b'\xcaOne, two! One, two! And through and thro...
8226 131280 39 b'\xca The vorpal blade went snicker-snack!\x...
8370 133568 35 b'\xcaHe left it dead, and with its head\x00\x...
8849 141216 27 b'\xca He went galumphing back.\x00\x00\x00\x...
10326 164832 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
11867 189472 37 b'\xca"And, has thou slain the Jabberwock?\x00...
12392 197856 35 b'\xca Come to my arms, my beamish boy!\x00\x...
12936 206544 34 b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
13794 220256 26 b'\xca He chortled in his joy.\x00\x00\x00\x0...
13905 222016 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
14690 234560 36 b"\xca'Twas brillig, and the slithy toves\x00\...
15317 244576 35 b'\xca Did gyre and gimble in the wabe;\x00\x...
15840 252928 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
16339 260896 31 b'\xca And the mome raths outgrabe.\x00\x00\x...
(py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
-rw-rw-rw- 1 user group 908773 Nov 2 13:07 test1.h5
-rw-rw-rw- 1 user group 1611025 Nov 2 13:35 test1.pq
(py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
Opened "test1.h5" with sec2 driver.
/ Group
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "GROUP"
Attribute: PYTABLES_FORMAT_VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "2.1"
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "1.0"
Location: 1:96
Links: 1
/data Group
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "GROUP"
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "1.0"
Location: 1:1024
Links: 1
/data/packets Dataset {16418/Inf}
Attribute: CLASS scalar
Type: 5-byte null-terminated UTF-8 string
Data: "TABLE"
Attribute: FIELD_0_FILL scalar
Type: native unsigned long long
Data: 0
Attribute: FIELD_0_NAME scalar
Type: 8-byte null-terminated UTF-8 string
Data: "timetick"
Attribute: FIELD_1_FILL scalar
Type: native unsigned short
Data: 0
Attribute: FIELD_1_NAME scalar
Type: 6-byte null-terminated UTF-8 string
Data: "length"
Attribute: FIELD_2_FILL scalar
Type: native unsigned char
Data: 0
Attribute: FIELD_2_NAME scalar
Type: 4-byte null-terminated UTF-8 string
Data: "data"
Attribute: NROWS scalar
Type: native long long
Data: 16418
Attribute: TITLE null
Type: 1-byte null-terminated UTF-8 string
Attribute: VERSION scalar
Type: 3-byte null-terminated UTF-8 string
Data: "2.7"
Location: 1:2216
Links: 1
Chunks: {247} 65455 bytes
Storage: 4350770 logical bytes, 899061 allocated bytes, 483.92%
utilization
Filter-0: shuffle-2 OPT {265}
Filter-1: deflate-1 OPT {5}
Type: struct {
"timetick" +0 native unsigned long long
"length" +8 native unsigned short
"data" +10 [255] native unsigned char
} 265 bytes