daniel-x opened a new issue, #49715: URL: https://github.com/apache/arrow/issues/49715
### Describe the enhancement requested tl;dr For FLOAT and DOUBLE columns, the dictionary encoding of parquet actively hurts the file size and io speeds. Defaulting to PLAIN instead of the dictionary solves this problem and has positive effects only, namely smaller file size, improved io speed, and perfect backwards compatibility. This should be done. Details: The default in parquet is to try using dictionary encoding and fall back to plain if this fails. Float32/64 numbers are typically high cardinality, so the dictionary encoding usually fails and falls back to plain, which causes an overhead and bloats the size of float32/64 columns by 25%/12.5% respectively. Also subsequent compression like zstd struggles with the bloated column data, which bloats the size of these columns even with compression enabled. Switching to PLAIN instead of dictionary for float32/64 columns removes the overhead in uncompressed mode and, if compression is used, enables subsequent compressors to work as a result (36% smaller file size + writing speed increase in my benchmark). Moreover, PLAIN has always been supported by parquet readers and writers, so this change is fully backwards compatible to all versions. Switching to BYTE_STREAM_SPLIT as default for float32/64columns has been discussed before, as it would result in even smaller file sizes and also improved writing speed compared to dictionary, but it would introduce an issue with backward compatibility, so this option is not an advantage-only alternative. Therefor switching to PLAIN as default is a better step. Attached are benchmark results with random data drawn from a N(mu=1000, std=500) distribution, which results in 36% smaller file size with zstd compression and 30% smaller file size without any subsequent compression for float32 data. <img width="715" height="608" alt="Image" src="https://github.com/user-attachments/assets/7fa84e18-8e37-4105-b77a-e1ab8400f111" /> Benchmark Script (This is ai generated.): import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.feather as feather import os import time # Generate data: 100k float32 values, normal distribution, mean=1000, std=500 np.random.seed(42) NUM_COLS = 100 NUM_ROWS = 100_000 data = np.random.normal(loc=1000, scale=500, size=(NUM_ROWS, NUM_COLS)).astype(np.float32) # Build Arrow table columns = {f'col_{i:03d}': pa.array(data[:, i], type=pa.float32()) for i in range(NUM_COLS)} table = pa.table(columns) total_values = NUM_ROWS * NUM_COLS raw_size = total_values * 4 # 4 bytes per float32 print(f"Data: {NUM_ROWS} rows x {NUM_COLS} columns = {total_values:,} float32 values") print(f"Raw size: {raw_size:,} bytes ({raw_size/1024/1024:.2f} MB)") print() # ── Configurations to test ────────────────────────────────────────────────── configs = [] # Arrow IPC (Feather) variants configs.append(("Arrow IPC — no compression", "ipc", {"compression": "uncompressed"})) configs.append(("Arrow IPC — Zstd", "ipc", {"compression": "zstd"})) # Parquet: default encoding (dictionary → fallback to plain) configs.append(("Parquet default — no compression", "parquet", { "compression": "NONE"})) configs.append(("Parquet default — Gzip", "parquet", { "compression": "gzip"})) configs.append(("Parquet default — Brotli", "parquet", { "compression": "brotli"})) configs.append(("Parquet default — Zstd", "parquet", { "compression": "zstd"})) # Parquet: PLAIN encoding (no dictionary, no BSS) configs.append(("Parquet PLAIN — Gzip", "parquet", { "compression": "gzip", "use_dictionary": False})) configs.append(("Parquet PLAIN — Brotli", "parquet", { "compression": "brotli", "use_dictionary": False})) configs.append(("Parquet PLAIN — Zstd", "parquet", { "compression": "zstd", "use_dictionary": False})) # Parquet: BYTE_STREAM_SPLIT configs.append(("Parquet BSS — no compression", "parquet", { "compression": "NONE", "use_dictionary": False, "use_byte_stream_split": True})) configs.append(("Parquet BSS — Gzip", "parquet", { "compression": "gzip", "use_dictionary": False, "use_byte_stream_split": True})) configs.append(("Parquet BSS — Brotli", "parquet", { "compression": "brotli", "use_dictionary": False, "use_byte_stream_split": True})) configs.append(("Parquet BSS — Zstd", "parquet", { "compression": "zstd", "use_dictionary": False, "use_byte_stream_split": True})) # ── Run benchmarks ────────────────────────────────────────────────────────── results = [] for name, fmt, opts in configs: filepath = f"/home/claude/test_output.{'arrow' if fmt == 'ipc' else 'parquet'}" # Write times_write = [] for _ in range(5): t0 = time.perf_counter() if fmt == "ipc": feather.write_feather(table, filepath, **opts) else: pq.write_table(table, filepath, **opts) t1 = time.perf_counter() times_write.append(t1 - t0) file_size = os.path.getsize(filepath) # Read times_read = [] for _ in range(5): t0 = time.perf_counter() if fmt == "ipc": _ = feather.read_table(filepath) else: _ = pq.read_table(filepath) t1 = time.perf_counter() times_read.append(t1 - t0) avg_write = np.median(times_write) * 1000 avg_read = np.median(times_read) * 1000 ratio = file_size / raw_size * 100 results.append((name, file_size, ratio, avg_write, avg_read)) os.remove(filepath) # ── Print results ─────────────────────────────────────────────────────────── print(f"{'Configuration':<38} {'Size':>10} {'Ratio':>8} {'Write':>10} {'Read':>10}") print(f"{'':─<38} {'(bytes)':>10} {'(%)':>8} {'(ms)':>10} {'(ms)':>10}") print("─" * 80) for name, size, ratio, wt, rt in results: print(f"{name:<38} {size:>10,} {ratio:>7.1f}% {wt:>9.1f} {rt:>9.1f}") print() print("BSS = BYTE_STREAM_SPLIT") print("Ratio = file size / raw float32 size") ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
