[ 
https://issues.apache.org/jira/browse/PARQUET-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2414:
---------------------------------------

    Assignee: Antoine Pitrou

> [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and 
> INT64
> ----------------------------------------------------------------------------------
>
>                 Key: PARQUET-2414
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2414
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Antoine Pitrou
>            Assignee: Antoine Pitrou
>            Priority: Minor
>         Attachments: bss_fp16.png, bss_ints_nyctaxi.png, 
> bss_ints_osm_belgium.png, bss_ints_osm_changesets.png, bss_ints_pypi.png, 
> bss_osm_belgium.png, bss_osm_changesets.png
>
>
> In PARQUET-1622 we added the BYTE_STREAM_SPLIT encoding which, while simple 
> to implement, allows to significantly improve compression efficiency on FLOAT 
> and DOUBLE columns.
> In PARQUET-758 we added the FLOAT16 logical type which annotates a 
> 2-byte-wide FIXED_LEN_BYTE_ARRAY column to denote that it contains 16-bit 
> IEEE binary floating-point (colloquially called "half float").
> This issue proposes to widen the types supported by the BYTE_STREAM_SPLIT 
> encoding. By allowing the BYTE_STREAM_SPLIT encoding on any 
> FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression 
> efficiency on various column types including:
> * half-float data
> * fixed-width decimal data
> Also, by allowing the BYTE_STREAM_SPLIT encoding on any INT32 or INT64 
> column, we can improve compression efficiency on further column types such as 
> timestamps.
> I've run compression measurements on various pieces of sample data which I 
> detail below.
> h2. Float16 data
> I've downloaded the sample datasets from
> https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPsingle/ , 
> uncompressed them and converted them to half-float using NumPy. Two files had 
> to be discarded because of overflow when converting to half-float.
> I've then run three different compression algorithms (lz4, zstd, snappy), 
> optionally preceded by a BYTE_STREAM_SPLIT encoding with 2 streams 
> (corresponding to the byte width of the FLBA columns. Here are the results:
> {code}
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | name           |   uncompressed |    lz4 |   bss_lz4 |   snappy |   
> bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   
> bss_ratio_zstd |
> +================+================+========+===========+==========+==============+========+============+=================+====================+==================+
> | msg_sp.sp      |    72526464.00 |   1.42 |      1.94 |     1.38 |         
> 1.78 |   2.28 |       2.71 |            1.37 |               1.30 |           
>   1.18 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | msg_sppm.sp    |    69748966.00 |  18.90 |     29.05 |    11.38 |        
> 14.39 |  45.81 |      71.49 |            1.54 |               1.26 |          
>    1.56 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | msg_sweep3d.sp |    31432806.00 |   2.06 |      3.20 |     1.03 |         
> 1.94 |  11.77 |      17.00 |            1.55 |               1.89 |           
>   1.44 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_brain.sp   |    35460000.00 |   1.02 |      1.51 |     1.01 |         
> 1.49 |   1.26 |       1.81 |            1.49 |               1.48 |           
>   1.44 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_comet.sp   |    26836992.00 |   1.45 |      1.74 |     1.42 |         
> 1.69 |   1.64 |       2.07 |            1.20 |               1.19 |           
>   1.26 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_control.sp |    39876186.00 |   1.35 |      1.49 |     1.37 |         
> 1.53 |   1.70 |       1.93 |            1.11 |               1.12 |           
>   1.14 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_plasma.sp  |     8772400.00 | 123.88 |    152.12 |     1.00 |         
> 1.80 | 259.58 |     405.96 |            1.23 |               1.80 |           
>   1.56 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_error.sp   |    15540204.00 |   1.05 |      1.51 |     1.02 |         
> 1.46 |   2.06 |       3.55 |            1.44 |               1.43 |           
>   1.72 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_info.sp    |     4732632.00 |   1.08 |      1.74 |     1.00 |         
> 1.61 |   2.60 |       3.63 |            1.62 |               1.61 |           
>   1.40 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_spitzer.sp |    49545216.00 |   1.00 |      1.01 |     1.00 |         
> 1.01 |   1.22 |       1.35 |            1.01 |               1.01 |           
>   1.11 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_temp.sp    |     9983568.00 |   1.00 |      1.00 |     1.00 |         
> 1.00 |   1.08 |       1.17 |            1.00 |               1.00 |           
>   1.08 |
> +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> {code}
> !bss_fp16.png!
> Explanation:
> * the columns "lz4", "snappy", "zstd" show the compression ratio achieved 
> with the respective compressors (i.e. uncompressed size divided by compressed 
> size)
> * the columns "bss_lz4", "bss_snappy", "bss_zstd" are similar, but with a 
> BYTE_STREAM_SPLIT encoding applied first
> * the columns "bss_ratio_lz4", "bss_ratio_snappy", "bss_ratio_zstd" show the 
> additional compression ratio achieved by prepending the BYTE_STREAM_SPLIT 
> encoding step (i.e. PLAIN-encoded compressed size divided by 
> BYTE_STREAM_SPLIT-encoded compressed size).
> h3. (reference) Float32 data
> For reference, here are the measurements for the original single-precision 
> floating-point data.
> {code}
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | name           |   uncompressed |   lz4 |   bss_lz4 |   snappy |   
> bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   
> bss_ratio_zstd |
> +================+================+=======+===========+==========+==============+========+============+=================+====================+==================+
> | msg_sp.sp      |   145052928.00 |  1.00 |      1.45 |     1.00 |         
> 1.39 |   1.12 |       1.66 |            1.46 |               1.39 |           
>   1.48 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | msg_sppm.sp    |   139497932.00 |  8.56 |      8.66 |     5.64 |         
> 5.90 |  12.51 |      11.16 |            1.01 |               1.05 |           
>   0.89 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | msg_sweep3d.sp |    62865612.00 |  1.01 |      2.80 |     1.02 |         
> 1.68 |   5.50 |       9.41 |            2.76 |               1.66 |           
>   1.71 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_brain.sp   |    70920000.00 |  1.00 |      1.31 |     1.00 |         
> 1.30 |   1.13 |       1.43 |            1.31 |               1.30 |           
>   1.27 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_comet.sp   |    53673984.00 |  1.08 |      1.27 |     1.08 |         
> 1.27 |   1.15 |       1.36 |            1.17 |               1.18 |           
>   1.18 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_control.sp |    79752372.00 |  1.01 |      1.12 |     1.01 |         
> 1.13 |   1.08 |       1.21 |            1.11 |               1.12 |           
>   1.12 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | num_plasma.sp  |    17544800.00 |  1.00 |    140.74 |     1.01 |         
> 1.30 | 279.49 |     310.68 |          141.29 |               1.30 |           
>   1.11 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_error.sp   |    31080408.00 |  1.12 |      1.37 |     1.16 |         
> 1.29 |   1.73 |       3.10 |            1.22 |               1.11 |           
>   1.80 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_info.sp    |     9465264.00 |  1.07 |      1.42 |     1.00 |         
> 1.29 |   2.25 |       3.04 |            1.33 |               1.29 |           
>   1.35 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_spitzer.sp |    99090432.00 |  1.02 |      1.11 |     1.01 |         
> 1.12 |   1.20 |       1.31 |            1.09 |               1.10 |           
>   1.09 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | obs_temp.sp    |    19967136.00 |  1.00 |      1.12 |     1.00 |         
> 1.13 |   1.08 |       1.19 |            1.12 |               1.13 |           
>   1.10 |
> +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> {code}
> h3. Comments
> The additional efficiency of the BYTE_STREAM_SPLIT encoding step is very 
> significant on most files (except {{obs_temp.sp}} which generally doesn't 
> compress at all), with additional gains usually around 30%.
> The BYTE_STREAM_SPLIT encoding is, perhaps surprisingly, on average as 
> beneficial on Float16 data as it is on Float32 data.
> h2. Decimal data from OpenStreetMap changesets
> I've downloaded one of the recent OSM changesets file 
> {{changesets-231030.orc}}, and loaded the four decimal columns from the first 
> stripe of that file. Those columns look like:
> {code}
> pyarrow.RecordBatch
> min_lat: decimal128(9, 7)
> max_lat: decimal128(9, 7)
> min_lon: decimal128(10, 7)
> max_lon: decimal128(10, 7)
> ----
> min_lat: 
> [51.5288506,51.0025063,51.5326805,51.5248871,51.5266800,51.5261841,51.5264130,51.5238914,59.9463692,59.9513092,...,50.8238277,52.1707376,44.2701598,53.1589748,43.5988333,37.7867167,45.5448822,null,50.7998334,50.5653478]
> max_lat: 
> [51.5288620,51.0047760,51.5333176,51.5289383,51.5291901,51.5300598,51.5264130,51.5238914,59.9525642,59.9561501,...,50.8480772,52.1714300,44.3790161,53.1616817,43.6001496,37.7867913,45.5532716,null,51.0188961,50.5691352]
> min_lon: 
> [-0.1465242,-1.0052705,-0.1566335,-0.1485492,-0.1418076,-0.1550623,-0.1539768,-0.1432930,10.7782278,10.7719727,...,10.6863813,13.2218676,19.8840738,8.9128186,1.4030591,-122.4212761,18.6789571,null,-4.2085209,8.6851671]
> max_lon: 
> [-0.1464925,-0.9943439,-0.1541054,-0.1413791,-0.1411505,-0.1453212,-0.1539768,-0.1432930,10.7898550,10.7994537,...,10.7393494,13.2298706,20.2262343,8.9183611,1.4159345,-122.4212503,18.6961594,null,-4.0496079,8.6879264]
> {code}
> Here are the compression measurements using the same methodology as above. 
> The number of BYTE_STREAM_SPLIT streams is the respective byte width of each 
> FLBA column (i.e., 4 for latitudes and 5 for longitudes).
> {code}
> +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | name    |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   
> zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
> +=========+================+=======+===========+==========+==============+========+============+=================+====================+==================+
> | min_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   
> 1.05 |       1.12 |            1.01 |               1.03 |             1.07 |
> +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | max_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   
> 1.05 |       1.13 |            1.01 |               1.03 |             1.07 |
> +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | min_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   
> 1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
> +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | max_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   
> 1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
> +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> {code}
> !bss_osm_changesets.png!
> h3. Comments
> On this dataset, compression efficiency is generally quite poor and 
> BYTE_STREAM_SPLIT encoding brings almost no additional efficiency to the 
> table. It can be assumed that OSM changeset entries have geographical 
> coordinates all over the place (literally!) and therefore do not offer many 
> opportunities for compression.
> h2. Decimal data from an OpenStreetMap region
> I've chosen a small region of the world (Belgium) whose geographical 
> coordinates presumably allow for better compression by being much more 
> clustered. The file {{belgium-latest.osm.pbf}} was converted to ORC for 
> easier handling, resulting in a 745 MB ORC file.
> I've then loaded the decimal columns from the first stripe in that file:
> {code}
> pyarrow.RecordBatch
> lat: decimal128(9, 7)
> lon: decimal128(10, 7)
> ----
> lat: 
> [50.4443865,50.4469017,50.4487890,50.4499558,50.4523446,50.4536530,50.4571053,50.4601436,50.4631197,50.4678563,...,51.1055899,51.1106197,51.1049620,51.1047010,51.1104755,51.0997955,51.1058101,51.1010664,51.1014336,51.1055106]
> lon: 
> [3.6857362,3.6965046,3.7074481,3.7173626,3.8126033,3.9033178,3.9193678,3.9253319,3.9292409,3.9332670,...,4.6663214,4.6699997,4.6720536,4.6655159,4.6666372,4.6680394,4.6747172,4.6684242,4.6713693,4.6644899]
> {code}
> Here are the compression measurements for these columns. As in the previous 
> dataset, the number of BYTE_STREAM_SPLIT streams is the respective byte width 
> of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).
> {code}
> +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | name   |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   
> zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
> +========+================+=======+===========+==========+==============+========+============+=================+====================+==================+
> | lat    |    12103680.00 |  1.00 |      1.63 |     1.00 |         1.63 |   
> 1.18 |       1.73 |            1.63 |               1.63 |             1.47 |
> +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> | lon    |    15129600.00 |  1.00 |      1.93 |     1.00 |         1.90 |   
> 1.27 |       2.06 |            1.93 |               1.90 |             1.62 |
> +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
> {code}
>  !bss_osm_belgium.png!
> h3. Comments
> This dataset shows that a BYTE_STREAM_SPLIT encoding before compression 
> achieves a very significant additional efficiency compared to compression 
> alone.
> h2. Integer data from two OpenStreetMap data files
> I also tried to evaluate the efficiency of BYTE_STREAM_SPLIT on integer 
> columns (INT32 or INT64). Here, however, another efficient encoding is 
> already available (DELTA_BINARY_PACKED). So the evaluation focussed on 
> comparing BYTE_STREAM_SPLIT + compression against DELTA_BINARY_PACKED alone.
> The comparison was done on the two same OpenStreetMap files as above, using 
> only the first stripe. Here are the measurement results in table format:
> !bss_ints_osm_changesets.png!
> !bss_ints_osm_belgium.png!
> **Caution**: the DELTA_BINARY_PACKED length measurement did not use a real 
> encoder implementation, but a length estimation function written in pure 
> Python. The estimation function should be accurate according to quick tests.
> h3. Comments
> The results are very heterogeneous, depending on the kind of data those 
> integer columns represent.
> Some columns achieve very good compression ratios, far above 10x, with all 
> methods; for these columns, it does not make sense to compare the compression 
> ratios, since the column sizes will be very small in all cases; performance 
> and interoperability should be the only concerns.
> On other columns, the compression ratios are more moderate and 
> BYTE_STREAM_SPLIT + compression seems to be preferable to DELTA_BINARY_PACKED.
> h2. Integer data from a PyPI archive file
> I downloaded one of the "index" Parquet files from 
> https://github.com/pypi-data/data/releases and read the first row group.
> The measurement results are as follows:
> !bss_ints_pypi.png!
> h3. Comments
> On this data, BYTE_STREAM_SPLIT + compression is clearly better than 
> DELTA_BINARY_PACKED. The timestamp column ("uploaded_on") in particular shows 
> very strong benefits.
> h2. Integer data from a NYC "yellow" taxi file
> I downloaded one of the "yellow" taxi trip records from 
> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and read the 
> first row group. The measurement results are as follows:
> !bss_ints_nyctaxi.png!
> h3. Comments
> These results are a bit of a mixed bag. Only BYTE_STREAM_SPLIT + zstd is 
> consistenly superior to DELTA_BINARY_PACKED. However, if one focusses on the 
> timestamp columns, then all three general-purpose compressors provide a 
> benefit.
> h2. Discussion
> When reading these results, it is important to remind that the exact 
> compression ratios do not necessarily matter, as long as the efficiency is 
> high enough. A compressor that achieves 100x compression on a column is not 
> necessarily worse than one that achieves 300x compression on the same column: 
> both are "good enough" on this particular data. On the contrary, when 
> compression ratios are moderate (lower than 10x), they should certainly be 
> compared.
> h3. Efficiency
> h4. Efficiency on FIXED_LEN_BYTE_ARRAY data
> These examples show that extending the BYTE_STREAM_SPLIT encoding to 
> FIXED_LEN_BYTE_ARRAY columns (even regardless of their logical types) can 
> yield very significant compression efficiency improvements on two specific 
> types of FIXED_LEN_BYTE_ARRAY data: FLOAT16 data and DECIMAL data.
> h4. Efficiency on INT32 / INT64 data
> Extending the BYTE_STREAM_SPLIT encoding to INT32 and INT64 columns can bring 
> significant benefits over DELTA_BINARY_PACKED. However, whether and by how 
> much depends on the kind of data that is encoded as integers. Timestamps seem 
> to always benefit from BYTE_STREAM_SPLIT encoding. Pairing BYTE_STREAM_SPLIT 
> with zstd also generally achieves higher efficiency than DELTA_BINARY_PACKED.
> Whether to choose BYTE_STREAM_SPLIT + compression over DELTA_BINARY_PACKED 
> will in practice have to be informed by several factors, such as performance 
> expectations and interoperability. Sophisticated writers might also implement 
> some form of sampling to find out the best encoding + compression combination 
> for a given column.
> **Note**: all tested data above is actually INT64. However, given the 
> mechanics of BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED, we can assume that 
> similar results would have been obtained for INT32 data.
> h3. Performance
> Since BYTE_STREAM_SPLIT only brings benefits in combination with compression, 
> the overall encoding + compression cost should be considered.
> h4. Performance on FIXED_LEN_BYTE_ARRAY data
> The choice is between BYTE_STREAM_SPLIT + compression vs. compression alone. 
> Even a non-SIMD optimized version of BYTE_STREAM_SPLIT, such as in Parquet 
> C++, can achieve multiple GB/s; there is little reason to pay the cost of 
> compression but refuse to pay the much smaller cost of the BYTE_STREAM_SPLIT 
> encoding step.
> h4. Performance on INT32 / INT64 data
> The choice is between BYTE_STREAM_SPLIT + compression vs. DELTA_BINARY_PACKED 
> alone. DELTA_BINARY_PACKED has a significant performance edge. The current 
> Parquet C++ implementation of DELTA_BINARY_PACKED encodes between 600 MB/s 
> and 2 GB/s, and decodes between 3 and 6 GB/s. This is faster than any of the 
> general-purpose compression schemes available in Parquet, even lz4.
> h3. Implementation complexity
> BYTE_STREAM_SPLIT, even byte width-agnostic, is almost trivial to implement. 
> A simple implementation can yield good performance with a minimum of work.
> For example, the non-SIMD-optimized BYTE_STREAM_SPLIT encoding and decoding 
> routines in Parquet C++ amount to a mere total of ~200 lines of code, despite 
> explicitly-unrolled loops:
> https://github.com/apache/arrow/blob/4e58f7ca0016c2b2d8a859a0c5965df3b15523e0/cpp/src/arrow/util/byte_stream_split_internal.h#L593-L702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to