daniel-x opened a new issue, #49715:
URL: https://github.com/apache/arrow/issues/49715

   ### Describe the enhancement requested
   
   tl;dr
   For FLOAT and DOUBLE columns, the dictionary encoding of parquet actively 
hurts the file size and io speeds. Defaulting to PLAIN instead of the 
dictionary solves this problem and has positive effects only, namely smaller 
file size, improved io speed, and perfect backwards compatibility. This should 
be done.
   
   Details:
   The default in parquet is to try using dictionary encoding and fall back to 
plain if this fails. Float32/64 numbers are typically high cardinality, so the 
dictionary encoding usually fails and falls back to plain, which causes an 
overhead and bloats the size of float32/64 columns by 25%/12.5% respectively. 
Also subsequent compression like zstd struggles with the bloated column data, 
which bloats the size of these columns even with compression enabled.
   Switching to PLAIN instead of dictionary for float32/64 columns removes the 
overhead in uncompressed mode and, if compression is used, enables subsequent 
compressors to work as a result (36% smaller file size + writing speed increase 
in my benchmark). Moreover, PLAIN has always been supported by parquet readers 
and writers, so this change is fully backwards compatible to all versions.
   Switching to BYTE_STREAM_SPLIT as default for float32/64columns has been 
discussed before, as it  would result in even smaller file sizes and also 
improved writing speed compared to dictionary, but it would introduce an issue 
with backward compatibility, so this option is not an advantage-only 
alternative. Therefor switching to PLAIN as default is a better step.
   
   Attached are benchmark results with random data drawn from a N(mu=1000, 
std=500) distribution, which results in 36% smaller file size with zstd 
compression and 30% smaller file size without any subsequent compression for 
float32 data.
   
   <img width="715" height="608" alt="Image" 
src="https://github.com/user-attachments/assets/7fa84e18-8e37-4105-b77a-e1ab8400f111";
 />
   
   
   Benchmark Script (This is ai generated.):
   
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.feather as feather
   import os
   import time
   
   # Generate data: 100k float32 values, normal distribution, mean=1000, std=500
   np.random.seed(42)
   NUM_COLS = 100
   NUM_ROWS = 100_000
   data = np.random.normal(loc=1000, scale=500, size=(NUM_ROWS, 
NUM_COLS)).astype(np.float32)
   
   # Build Arrow table
   columns = {f'col_{i:03d}': pa.array(data[:, i], type=pa.float32()) for i in 
range(NUM_COLS)}
   table = pa.table(columns)
   
   total_values = NUM_ROWS * NUM_COLS
   raw_size = total_values * 4  # 4 bytes per float32
   print(f"Data: {NUM_ROWS} rows x {NUM_COLS} columns = {total_values:,} 
float32 values")
   print(f"Raw size: {raw_size:,} bytes ({raw_size/1024/1024:.2f} MB)")
   print()
   
   # ── Configurations to test 
──────────────────────────────────────────────────
   configs = []
   
   # Arrow IPC (Feather) variants
   configs.append(("Arrow IPC — no compression", "ipc", {"compression": 
"uncompressed"}))
   configs.append(("Arrow IPC — Zstd", "ipc", {"compression": "zstd"}))
   
   # Parquet: default encoding (dictionary → fallback to plain)
   configs.append(("Parquet default — no compression", "parquet", {
       "compression": "NONE"}))
   configs.append(("Parquet default — Gzip", "parquet", {
       "compression": "gzip"}))
   configs.append(("Parquet default — Brotli", "parquet", {
       "compression": "brotli"}))
   configs.append(("Parquet default — Zstd", "parquet", {
       "compression": "zstd"}))
   
   # Parquet: PLAIN encoding (no dictionary, no BSS)
   configs.append(("Parquet PLAIN — Gzip", "parquet", {
       "compression": "gzip",
       "use_dictionary": False}))
   configs.append(("Parquet PLAIN — Brotli", "parquet", {
       "compression": "brotli",
       "use_dictionary": False}))
   configs.append(("Parquet PLAIN — Zstd", "parquet", {
       "compression": "zstd",
       "use_dictionary": False}))
   
   # Parquet: BYTE_STREAM_SPLIT
   configs.append(("Parquet BSS — no compression", "parquet", {
       "compression": "NONE",
       "use_dictionary": False,
       "use_byte_stream_split": True}))
   configs.append(("Parquet BSS — Gzip", "parquet", {
       "compression": "gzip",
       "use_dictionary": False,
       "use_byte_stream_split": True}))
   configs.append(("Parquet BSS — Brotli", "parquet", {
       "compression": "brotli",
       "use_dictionary": False,
       "use_byte_stream_split": True}))
   configs.append(("Parquet BSS — Zstd", "parquet", {
       "compression": "zstd",
       "use_dictionary": False,
       "use_byte_stream_split": True}))
   
   # ── Run benchmarks 
──────────────────────────────────────────────────────────
   results = []
   
   for name, fmt, opts in configs:
       filepath = f"/home/claude/test_output.{'arrow' if fmt == 'ipc' else 
'parquet'}"
       
       # Write
       times_write = []
       for _ in range(5):
           t0 = time.perf_counter()
           if fmt == "ipc":
               feather.write_feather(table, filepath, **opts)
           else:
               pq.write_table(table, filepath, **opts)
           t1 = time.perf_counter()
           times_write.append(t1 - t0)
       
       file_size = os.path.getsize(filepath)
       
       # Read
       times_read = []
       for _ in range(5):
           t0 = time.perf_counter()
           if fmt == "ipc":
               _ = feather.read_table(filepath)
           else:
               _ = pq.read_table(filepath)
           t1 = time.perf_counter()
           times_read.append(t1 - t0)
       
       avg_write = np.median(times_write) * 1000
       avg_read = np.median(times_read) * 1000
       ratio = file_size / raw_size * 100
       
       results.append((name, file_size, ratio, avg_write, avg_read))
       os.remove(filepath)
   
   # ── Print results 
───────────────────────────────────────────────────────────
   print(f"{'Configuration':<38} {'Size':>10} {'Ratio':>8} {'Write':>10} 
{'Read':>10}")
   print(f"{'':─<38} {'(bytes)':>10} {'(%)':>8} {'(ms)':>10} {'(ms)':>10}")
   print("─" * 80)
   
   for name, size, ratio, wt, rt in results:
       print(f"{name:<38} {size:>10,} {ratio:>7.1f}% {wt:>9.1f} {rt:>9.1f}")
   
   print()
   print("BSS = BYTE_STREAM_SPLIT")
   print("Ratio = file size / raw float32 size")
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to