progval opened a new pull request, #5860:
URL: https://github.com/apache/arrow-rs/pull/5860

   # Which issue does this PR close?
   
   Closes #5859.
   
   # Rationale for this change
    
   This allows Bloom filters to not be saved in memory, which can save 
significant space when writing long files. This switches between the two 
layouts [mentioned in the 
spec](https://parquet.apache.org/docs/file-format/bloomfilter/#file-format)
   
   # What changes are included in this PR?
   
   This includes a script that demonstrates the memory usage.
   
   Increases linearly up to 4.3GB of RAM before the change:
   
   ```
   $ cargo run --example write_parquet --release --features=log
       Finished release [optimized] target(s) in 0.11s
        Running `target/release/examples/write_parquet`
   12:52:11 [INFO] Writing batches
   12:52:21 [INFO] 267 iterations, 10s, 26.68 iterations/s, 37.48 
ms/iterations; 8.90% done, 1m 42s to end; res/vir/avail/free/total mem 
399.72MB/419.99MB/25.93GB/10.45GB/33.44GB
   12:52:31 [INFO] 536 iterations, 20s, 26.75 iterations/s, 37.38 
ms/iterations; 17.87% done, 1m 31s to end; res/vir/avail/free/total mem 
805.78MB/829.16MB/25.93GB/10.45GB/33.44GB
   12:52:41 [INFO] 805 iterations, 30s, 26.80 iterations/s, 37.31 
ms/iterations; 26.83% done, 1m 21s to end; res/vir/avail/free/total mem 
1.24GB/1.27GB/25.93GB/10.45GB/33.44GB
   12:52:51 [INFO] 1,073 iterations, 40s, 26.79 iterations/s, 37.33 
ms/iterations; 35.77% done, 1m 11s to end; res/vir/avail/free/total mem 
1.61GB/1.64GB/25.93GB/10.45GB/33.44GB
   12:53:01 [INFO] 1,342 iterations, 50s, 26.80 iterations/s, 37.31 
ms/iterations; 44.73% done, 1m 1s to end; res/vir/avail/free/total mem 
2.00GB/2.03GB/25.93GB/10.45GB/33.44GB
   12:53:11 [INFO] 1,610 iterations, 1m 0s, 26.80 iterations/s, 37.32 
ms/iterations; 53.67% done, 51s to end; res/vir/avail/free/total mem 
2.39GB/2.42GB/25.93GB/10.45GB/33.44GB
   12:53:21 [INFO] 1,869 iterations, 1m 10s, 26.65 iterations/s, 37.52 
ms/iterations; 62.30% done, 42s to end; res/vir/avail/free/total mem 
2.78GB/2.82GB/25.93GB/10.45GB/33.44GB
   12:53:31 [INFO] 2,130 iterations, 1m 20s, 26.57 iterations/s, 37.63 
ms/iterations; 71.00% done, 32s to end; res/vir/avail/free/total mem 
3.16GB/3.21GB/25.93GB/10.45GB/33.44GB
   12:53:41 [INFO] 2,391 iterations, 1m 30s, 26.52 iterations/s, 37.71 
ms/iterations; 79.70% done, 22s to end; res/vir/avail/free/total mem 
3.54GB/3.59GB/25.93GB/10.45GB/33.44GB
   12:53:51 [INFO] 2,650 iterations, 1m 40s, 26.45 iterations/s, 37.80 
ms/iterations; 88.33% done, 13s to end; res/vir/avail/free/total mem 
3.93GB/3.98GB/25.93GB/10.45GB/33.44GB
   12:54:01 [INFO] 2,908 iterations, 1m 50s, 26.39 iterations/s, 37.90 
ms/iterations; 96.93% done, 3s to end; res/vir/avail/free/total mem 
4.32GB/4.37GB/25.93GB/10.45GB/33.44GB
   12:54:05 [INFO] Completed.
   12:54:05 [INFO] Elapsed: 1m 53s [3,000 iterations, 26.36 iterations/s, 37.93 
ms/iterations]; res/vir/avail/free/total mem 
4.49GB/4.54GB/25.93GB/10.45GB/33.44GB
   ```
   
   Remains constant at 55.2MB after the change:
   
   ```
   $ cargo run --example write_parquet --release --features=log
      Compiling parquet v51.0.0 (/home/rust/arrow-rs/parquet)
       Finished release [optimized] target(s) in 11.24s
        Running `target/release/examples/write_parquet`
   12:54:29 [INFO] Writing batches
   12:54:39 [INFO] 261 iterations, 10s, 26.02 iterations/s, 38.43 
ms/iterations; 8.70% done, 1m 44s to end; res/vir/avail/free/total mem 
49.92MB/69.59MB/25.87GB/10.40GB/33.44GB
   12:54:49 [INFO] 525 iterations, 20s, 26.20 iterations/s, 38.17 
ms/iterations; 17.50% done, 1m 34s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:54:59 [INFO] 791 iterations, 30s, 26.32 iterations/s, 38.00 
ms/iterations; 26.37% done, 1m 23s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:09 [INFO] 1,058 iterations, 40s, 26.40 iterations/s, 37.88 
ms/iterations; 35.27% done, 1m 13s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:19 [INFO] 1,325 iterations, 50s, 26.45 iterations/s, 37.81 
ms/iterations; 44.17% done, 1m 3s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:29 [INFO] 1,593 iterations, 1m 0s, 26.50 iterations/s, 37.74 
ms/iterations; 53.10% done, 53s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:39 [INFO] 1,861 iterations, 1m 10s, 26.54 iterations/s, 37.68 
ms/iterations; 62.03% done, 42s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:49 [INFO] 2,128 iterations, 1m 20s, 26.55 iterations/s, 37.66 
ms/iterations; 70.93% done, 32s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:55:59 [INFO] 2,384 iterations, 1m 30s, 26.44 iterations/s, 37.82 
ms/iterations; 79.47% done, 23s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:56:09 [INFO] 2,642 iterations, 1m 40s, 26.37 iterations/s, 37.91 
ms/iterations; 88.07% done, 13s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:56:19 [INFO] 2,900 iterations, 1m 50s, 26.32 iterations/s, 38.00 
ms/iterations; 96.67% done, 3s to end; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   12:56:23 [INFO] Completed.
   12:56:23 [INFO] Elapsed: 1m 54s [3,000 iterations, 26.29 iterations/s, 38.04 
ms/iterations]; res/vir/avail/free/total mem 
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
   ```
   
   This is a demo of the change, just to make sure this is something we want.
   In particular, this breaks `arrow::arrow_writer::tests::*_bloom_filter` 
because they expect to read the Bloom Filters from the memory at the end 
except... they aren't anymore.
   
   So if this looks good to you, I'll add a field in `WriterProperties` to 
switch between the old behavior (all Bloom Filters at the end) and this one 
(interleaved Bloom Filters). How should I call it?
   
   # Are there any user-facing changes?
   
   The layout of output files changes significantly. This may have a negative 
performance effect on readers expecting data locality, as Bloom Filters are now 
scattered across the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to