Hi all-- I've been getting started with Parquet as a storage alternative to HDF5 and it has a lot of attractive quantities including compression flexibility efficiency.
But I'm stumped for storage efficiency in Parquet with one type of data that I have. This is a large series of "ragged" packets arriving as a stream, where each packet consists of up to 255 bytes of binary data. The vast majority of the packets have lengths between 96 and 112 bytes. I need to store each of them with a 64-bit timestamp. I can get a good storage efficiency with HDF5 with the following table schema using pytables: class StoredPacket(pt.IsDescription): timetick = pt.UInt64Col(pos=0) length = pt.UInt16Col(pos=1) data = pt.UInt8Col(pos=2,shape=(255,)) This stores packet data as an array of uint8 with length 255. I zero-pad the packet to length 255 and store the length as well in a separate column. I have created a sample file in a Github gist: https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see test1.h5) along with a Python script convert_test1.py that converts it to a Pandas DataFrame and stores it via Parquet. But the Parquet files are almost twice as large as the .h5 file no matter what storage technique I use; brotli is best but slow, and zstd is almost as good as brotli but much faster. Any suggestions on how I might improve storage efficiency in Parquet? I have a lot of flexibility with how I can store the data; my only requirement is that I can retrieve the data packets quickly from the storage file. I offer this sample file as a test case. (py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py Table overview: timetick length data 0 16 99 b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\... 1 32 99 b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x... 2 48 99 b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\... 3 64 99 b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19... 4 80 99 b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\... ... ... ... ... 16413 262080 99 b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +... 16414 262096 99 b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+... 16415 262112 99 b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x... 16416 262128 99 b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf... 16417 262144 99 b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\... [16418 rows x 3 columns] Packets with tags >= 128: timetick length data 179 2864 36 b"\xca'Twas brillig, and the slithy toves\x00\... 307 4896 35 b'\xca Did gyre and gimble in the wabe:\x00\x... 340 5408 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0... 362 5744 31 b'\xca And the mome raths outgrabe.\x00\x00\x... 651 10352 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 1403 22368 32 b'\xca"Beware the Jabberwock, my son!\x00\x00\... 1741 27760 44 b'\xca The jaws that bite, the claws that cat... 2115 33728 33 b'\xcaBeware the Jubjub bird, and shun\x00\x00... 2162 34464 30 b'\xca The frumious Bandersnatch!"\x00\x00\x0... 2278 36304 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 2405 38320 34 b'\xcaHe took his vorpal sword in hand:\x00\x0... 2675 42624 41 b'\xca Long time the manxome foe he sought --... 2896 46144 33 b'\xcaSo rested he by the Tumtum tree,\x00\x00... 3611 57568 31 b'\xca And stood awhile in thought.\x00\x00\x... 4089 65200 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 5231 83456 36 b'\xcaAnd, as in uffish thought he stood,\x00\... 5236 83520 38 b'\xca The Jabberwock, with eyes of flame,\x0... 5427 86560 40 b'\xcaCame whiffling through the tulgey wood,\... 6904 110176 26 b'\xca And burbled as it came!\x00\x00\x00\x0... 7003 111744 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 7286 116256 44 b'\xcaOne, two! One, two! And through and thro... 8226 131280 39 b'\xca The vorpal blade went snicker-snack!\x... 8370 133568 35 b'\xcaHe left it dead, and with its head\x00\x... 8849 141216 27 b'\xca He went galumphing back.\x00\x00\x00\x... 10326 164832 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 11867 189472 37 b'\xca"And, has thou slain the Jabberwock?\x00... 12392 197856 35 b'\xca Come to my arms, my beamish boy!\x00\x... 12936 206544 34 b"\xcaO frabjous day! Callooh! Callay!'\x00\x0... 13794 220256 26 b'\xca He chortled in his joy.\x00\x00\x00\x0... 13905 222016 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... 14690 234560 36 b"\xca'Twas brillig, and the slithy toves\x00\... 15317 244576 35 b'\xca Did gyre and gimble in the wabe;\x00\x... 15840 252928 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0... 16339 260896 31 b'\xca And the mome raths outgrabe.\x00\x00\x... (py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.* -rw-rw-rw- 1 user group 908773 Nov 2 13:07 test1.h5 -rw-rw-rw- 1 user group 1611025 Nov 2 13:35 test1.pq (py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5 Opened "test1.h5" with sec2 driver. / Group Attribute: CLASS scalar Type: 5-byte null-terminated UTF-8 string Data: "GROUP" Attribute: PYTABLES_FORMAT_VERSION scalar Type: 3-byte null-terminated UTF-8 string Data: "2.1" Attribute: TITLE null Type: 1-byte null-terminated UTF-8 string Attribute: VERSION scalar Type: 3-byte null-terminated UTF-8 string Data: "1.0" Location: 1:96 Links: 1 /data Group Attribute: CLASS scalar Type: 5-byte null-terminated UTF-8 string Data: "GROUP" Attribute: TITLE null Type: 1-byte null-terminated UTF-8 string Attribute: VERSION scalar Type: 3-byte null-terminated UTF-8 string Data: "1.0" Location: 1:1024 Links: 1 /data/packets Dataset {16418/Inf} Attribute: CLASS scalar Type: 5-byte null-terminated UTF-8 string Data: "TABLE" Attribute: FIELD_0_FILL scalar Type: native unsigned long long Data: 0 Attribute: FIELD_0_NAME scalar Type: 8-byte null-terminated UTF-8 string Data: "timetick" Attribute: FIELD_1_FILL scalar Type: native unsigned short Data: 0 Attribute: FIELD_1_NAME scalar Type: 6-byte null-terminated UTF-8 string Data: "length" Attribute: FIELD_2_FILL scalar Type: native unsigned char Data: 0 Attribute: FIELD_2_NAME scalar Type: 4-byte null-terminated UTF-8 string Data: "data" Attribute: NROWS scalar Type: native long long Data: 16418 Attribute: TITLE null Type: 1-byte null-terminated UTF-8 string Attribute: VERSION scalar Type: 3-byte null-terminated UTF-8 string Data: "2.7" Location: 1:2216 Links: 1 Chunks: {247} 65455 bytes Storage: 4350770 logical bytes, 899061 allocated bytes, 483.92% utilization Filter-0: shuffle-2 OPT {265} Filter-1: deflate-1 OPT {5} Type: struct { "timetick" +0 native unsigned long long "length" +8 native unsigned short "data" +10 [255] native unsigned char } 265 bytes