felipecrv commented on PR #35:
URL: https://github.com/apache/arrow-experiments/pull/35#issuecomment-2339451687
We can make Brotli less impressive by feeding it more random data:
```
output2.arrows 945M
output2.arrows.br 264M (brotli still a winner)
output2.arrows.gz 344M (gzip now wins on compression)
output2.arrows.zstd 371M (Zstd is good)
```
(generated batches with random values instead of simply slicing from a big
array)
```diff
--- a/http/get_compressed/python/server/server.py
+++ b/http/get_compressed/python/server/server.py
@@ -72,13 +72,14 @@ def example_batches(tickers):
total_records = 42_000_000
batch_len = 6 * 1024
# all the batches sent are random slices of the larger base batch
- base_batch = example_batch(tickers, length=8 * batch_len)
+ # base_batch = example_batch(tickers, length=8 * batch_len)
batches = []
records = 0
while records < total_records:
length = min(batch_len, total_records - records)
- offset = randint(0, base_batch.num_rows - length - 1)
- batch = base_batch.slice(offset, length)
+ # offset = randint(0, base_batch.num_rows - length - 1)
+ # batch = base_batch.slice(offset, length)
+ batch = example_batch(tickers, length)
batches.append(batch)
records += length
return batches
```
What is the CPU overhead though? All requests are over `127.0.0.1` so CPU
cost should dominate. And `zstd` wins big even though it doesn't produce the
smallest possible response like brotli.
```
$ python client.py
[identity]: Requesting data from http://127.0.0.1:8008 with `identity`
encoding.
[identity]: Schema received in 0.008 seconds. schema=(ticker, price, volume).
[identity]: First batch of 6836 received and processed in 0.009 seconds
[identity]: Processing of all batches completed in 0.238 seconds.
[zstd]: Requesting data from http://127.0.0.1:8008 with `zstd` encoding.
[zstd]: Schema received in 0.004 seconds. schema=(ticker, price, volume).
[zstd]: First batch of 6836 received and processed in 0.004 seconds
[zstd]: Processing of all batches completed in 2.613 seconds.
[br]: Requesting data from http://127.0.0.1:8008 with `br` encoding.
[br]: Schema received in 0.512 seconds. schema=(ticker, price, volume).
[br]: First batch of 6836 received and processed in 0.512 seconds
[br]: Processing of all batches completed in 52.460 seconds.
[gzip]: Requesting data from http://127.0.0.1:8008 with `gzip` encoding.
[gzip]: Schema received in 0.044 seconds. schema=(ticker, price, volume).
[gzip]: First batch of 6836 received and processed in 0.044 seconds
[gzip]: Processing of all batches completed in 47.742 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]