felipecrv commented on PR #35:
URL: https://github.com/apache/arrow-experiments/pull/35#issuecomment-2339451687

   We can make Brotli less impressive by feeding it more random data:
   
   ```
   output2.arrows        945M
   output2.arrows.br     264M (brotli still a winner)
   output2.arrows.gz     344M (gzip now wins on compression)
   output2.arrows.zstd   371M (Zstd is good)
   ```
   
   (generated batches with random values instead of simply slicing from a big 
array)
   
   ```diff
   --- a/http/get_compressed/python/server/server.py
   +++ b/http/get_compressed/python/server/server.py
   @@ -72,13 +72,14 @@ def example_batches(tickers):
        total_records = 42_000_000
        batch_len = 6 * 1024
        # all the batches sent are random slices of the larger base batch
   -    base_batch = example_batch(tickers, length=8 * batch_len)
   +    # base_batch = example_batch(tickers, length=8 * batch_len)
        batches = []
        records = 0
        while records < total_records:
            length = min(batch_len, total_records - records)
   -        offset = randint(0, base_batch.num_rows - length - 1)
   -        batch = base_batch.slice(offset, length)
   +        # offset = randint(0, base_batch.num_rows - length - 1)
   +        # batch = base_batch.slice(offset, length)
   +        batch = example_batch(tickers, length)
            batches.append(batch)
            records += length
        return batches
   ```
   
   What is the CPU overhead though? All requests are over `127.0.0.1` so CPU 
cost should dominate. And `zstd` wins big even though it doesn't produce the 
smallest possible response like brotli.
   
   ```
   $ python client.py
   [identity]: Requesting data from http://127.0.0.1:8008 with `identity` 
encoding.
   [identity]: Schema received in 0.008 seconds. schema=(ticker, price, volume).
   [identity]: First batch of 6836 received and processed in 0.009 seconds
   [identity]: Processing of all batches completed in 0.238 seconds.
       [zstd]: Requesting data from http://127.0.0.1:8008 with `zstd` encoding.
       [zstd]: Schema received in 0.004 seconds. schema=(ticker, price, volume).
       [zstd]: First batch of 6836 received and processed in 0.004 seconds
       [zstd]: Processing of all batches completed in 2.613 seconds.
         [br]: Requesting data from http://127.0.0.1:8008 with `br` encoding.
         [br]: Schema received in 0.512 seconds. schema=(ticker, price, volume).
         [br]: First batch of 6836 received and processed in 0.512 seconds
         [br]: Processing of all batches completed in 52.460 seconds.
       [gzip]: Requesting data from http://127.0.0.1:8008 with `gzip` encoding.
       [gzip]: Schema received in 0.044 seconds. schema=(ticker, price, volume).
       [gzip]: First batch of 6836 received and processed in 0.044 seconds
       [gzip]: Processing of all batches completed in 47.742 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to