progval opened a new pull request, #5860:
URL: https://github.com/apache/arrow-rs/pull/5860
# Which issue does this PR close?
Closes #5859.
# Rationale for this change
This allows Bloom filters to not be saved in memory, which can save
significant space when writing long files. This switches between the two
layouts [mentioned in the
spec](https://parquet.apache.org/docs/file-format/bloomfilter/#file-format)
# What changes are included in this PR?
This includes a script that demonstrates the memory usage.
Increases linearly up to 4.3GB of RAM before the change:
```
$ cargo run --example write_parquet --release --features=log
Finished release [optimized] target(s) in 0.11s
Running `target/release/examples/write_parquet`
12:52:11 [INFO] Writing batches
12:52:21 [INFO] 267 iterations, 10s, 26.68 iterations/s, 37.48
ms/iterations; 8.90% done, 1m 42s to end; res/vir/avail/free/total mem
399.72MB/419.99MB/25.93GB/10.45GB/33.44GB
12:52:31 [INFO] 536 iterations, 20s, 26.75 iterations/s, 37.38
ms/iterations; 17.87% done, 1m 31s to end; res/vir/avail/free/total mem
805.78MB/829.16MB/25.93GB/10.45GB/33.44GB
12:52:41 [INFO] 805 iterations, 30s, 26.80 iterations/s, 37.31
ms/iterations; 26.83% done, 1m 21s to end; res/vir/avail/free/total mem
1.24GB/1.27GB/25.93GB/10.45GB/33.44GB
12:52:51 [INFO] 1,073 iterations, 40s, 26.79 iterations/s, 37.33
ms/iterations; 35.77% done, 1m 11s to end; res/vir/avail/free/total mem
1.61GB/1.64GB/25.93GB/10.45GB/33.44GB
12:53:01 [INFO] 1,342 iterations, 50s, 26.80 iterations/s, 37.31
ms/iterations; 44.73% done, 1m 1s to end; res/vir/avail/free/total mem
2.00GB/2.03GB/25.93GB/10.45GB/33.44GB
12:53:11 [INFO] 1,610 iterations, 1m 0s, 26.80 iterations/s, 37.32
ms/iterations; 53.67% done, 51s to end; res/vir/avail/free/total mem
2.39GB/2.42GB/25.93GB/10.45GB/33.44GB
12:53:21 [INFO] 1,869 iterations, 1m 10s, 26.65 iterations/s, 37.52
ms/iterations; 62.30% done, 42s to end; res/vir/avail/free/total mem
2.78GB/2.82GB/25.93GB/10.45GB/33.44GB
12:53:31 [INFO] 2,130 iterations, 1m 20s, 26.57 iterations/s, 37.63
ms/iterations; 71.00% done, 32s to end; res/vir/avail/free/total mem
3.16GB/3.21GB/25.93GB/10.45GB/33.44GB
12:53:41 [INFO] 2,391 iterations, 1m 30s, 26.52 iterations/s, 37.71
ms/iterations; 79.70% done, 22s to end; res/vir/avail/free/total mem
3.54GB/3.59GB/25.93GB/10.45GB/33.44GB
12:53:51 [INFO] 2,650 iterations, 1m 40s, 26.45 iterations/s, 37.80
ms/iterations; 88.33% done, 13s to end; res/vir/avail/free/total mem
3.93GB/3.98GB/25.93GB/10.45GB/33.44GB
12:54:01 [INFO] 2,908 iterations, 1m 50s, 26.39 iterations/s, 37.90
ms/iterations; 96.93% done, 3s to end; res/vir/avail/free/total mem
4.32GB/4.37GB/25.93GB/10.45GB/33.44GB
12:54:05 [INFO] Completed.
12:54:05 [INFO] Elapsed: 1m 53s [3,000 iterations, 26.36 iterations/s, 37.93
ms/iterations]; res/vir/avail/free/total mem
4.49GB/4.54GB/25.93GB/10.45GB/33.44GB
```
Remains constant at 55.2MB after the change:
```
$ cargo run --example write_parquet --release --features=log
Compiling parquet v51.0.0 (/home/rust/arrow-rs/parquet)
Finished release [optimized] target(s) in 11.24s
Running `target/release/examples/write_parquet`
12:54:29 [INFO] Writing batches
12:54:39 [INFO] 261 iterations, 10s, 26.02 iterations/s, 38.43
ms/iterations; 8.70% done, 1m 44s to end; res/vir/avail/free/total mem
49.92MB/69.59MB/25.87GB/10.40GB/33.44GB
12:54:49 [INFO] 525 iterations, 20s, 26.20 iterations/s, 38.17
ms/iterations; 17.50% done, 1m 34s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:54:59 [INFO] 791 iterations, 30s, 26.32 iterations/s, 38.00
ms/iterations; 26.37% done, 1m 23s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:09 [INFO] 1,058 iterations, 40s, 26.40 iterations/s, 37.88
ms/iterations; 35.27% done, 1m 13s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:19 [INFO] 1,325 iterations, 50s, 26.45 iterations/s, 37.81
ms/iterations; 44.17% done, 1m 3s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:29 [INFO] 1,593 iterations, 1m 0s, 26.50 iterations/s, 37.74
ms/iterations; 53.10% done, 53s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:39 [INFO] 1,861 iterations, 1m 10s, 26.54 iterations/s, 37.68
ms/iterations; 62.03% done, 42s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:49 [INFO] 2,128 iterations, 1m 20s, 26.55 iterations/s, 37.66
ms/iterations; 70.93% done, 32s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:59 [INFO] 2,384 iterations, 1m 30s, 26.44 iterations/s, 37.82
ms/iterations; 79.47% done, 23s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:09 [INFO] 2,642 iterations, 1m 40s, 26.37 iterations/s, 37.91
ms/iterations; 88.07% done, 13s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:19 [INFO] 2,900 iterations, 1m 50s, 26.32 iterations/s, 38.00
ms/iterations; 96.67% done, 3s to end; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:23 [INFO] Completed.
12:56:23 [INFO] Elapsed: 1m 54s [3,000 iterations, 26.29 iterations/s, 38.04
ms/iterations]; res/vir/avail/free/total mem
55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
```
This is a demo of the change, just to make sure this is something we want.
In particular, this breaks `arrow::arrow_writer::tests::*_bloom_filter`
because they expect to read the Bloom Filters from the memory at the end
except... they aren't anymore.
So if this looks good to you, I'll add a field in `WriterProperties` to
switch between the old behavior (all Bloom Filters at the end) and this one
(interleaved Bloom Filters). How should I call it?
# Are there any user-facing changes?
The layout of output files changes significantly. This may have a negative
performance effect on readers expecting data locality, as Bloom Filters are now
scattered across the file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]