Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Antoine Pitrou Tue, 03 Mar 2020 04:15:53 -0800


If we want to use a HTTP header, it would be more of a Accept-Encoding
header, no?


In any case, we would have to put non-standard values there (e.g. lz4),
so I'm not sure how desirable it is to repurpose HTTP headers for that,
rather than add some dedicated field to the Flight messages.

Regards

Antoine.


Le 03/03/2020 à 12:52, David Li a écrit :
> gRPC supports headers so for Flight, we could send essentially an Accept
> header and perhaps a Content-Type header.
> 
> David
> 
> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <[email protected]> wrote:
> 
>> Hi Wes,
>> A few thoughts on this.  In general, I think it is a good idea.  But before
>> proceeding, I think the following points are worth discussing:
>> 1.  Does this actually improve throughput/latency for Flight? (I think you
>> mentioned you would follow-up with benchmarks).
>> 2.  I think we should limit the number of supported compression schemes to
>> only 1 or 2.  I think the criteria for selection speed and native
>> implementations available across the widest possible languages.  As far as
>> i can tell zstd only have bindings in java via JNI, but my understanding is
>> it is probably the type of compression for our use-cases.  So I think
>> zstd + potentially 1 more.
>> 3.  Commitment from someone on the Java side to implement this.
>> 4.  This doesn't need to be coupled with this change per-se but for
>> something like flight it would be good to have a standard mechanism for
>> negotiating server/client capabilities (e.g. client doesn't support
>> compression or only supports a subset).
>>
>>
>> Thanks,
>> Micah
>>
>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <[email protected]> wrote:
>>
>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <[email protected]>
>> wrote:
>>>>
>>>>
>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
>>>>> In the context of a "next version of the Feather format" ARROW-5510
>>>>> (which is consumed only by Python and R at the moment), I have been
>>>>> looking at compressing buffers using fast compressors like ZSTD when
>>>>> writing the RecordBatch bodies. This could be handled privately as an
>>>>> implementation detail of the Feather file, but since ZSTD compression
>>>>> could improve throughput in Flight, for example, I thought I would
>>>>> bring it up for discussion.
>>>>>
>>>>> I can see two simple compression strategies:
>>>>>
>>>>> * Compress the entire message body in one-shot, writing the result
>> out
>>>>> with an 8-byte int64 prefix indicating the uncompressed size
>>>>> * Compress each non-zero-length constituent Buffer prior to writing
>> to
>>>>> the body (and using the same uncompressed-length-prefix when writing
>>>>> the compressed buffer)
>>>>>
>>>>> The latter strategy is preferable for scenarios where we may project
>>>>> out only a few fields from a larger record batch (such as reading
>> from
>>>>> a memory-mapped file).
>>>>
>>>> Agreed.  It may also allow using different compression strategies for
>>>> different kinds of buffers (for example a bytestream splitting strategy
>>>> for floats and doubles, or a delta encoding strategy for integers).
>>>
>>> If we wanted to allow for different compression to apply to different
>>> buffers, I think we will need a new Message type because this would
>>> inflate metadata sizes in a way that is not likely to be acceptable
>>> for the current uncompressed use case.
>>>
>>> Here is my strawman proposal
>>>
>>>
>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
>>>
>>>>> Implementation could be accomplished by one of the following methods:
>>>>>
>>>>> * Setting a field in Message.custom_metadata
>>>>> * Adding a new field to Message
>>>>
>>>> I think it has to be a new field in Message.  Making it an ignorable
>>>> metadata field means non-supporting receivers will decode and interpret
>>>> the data wrongly.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>
>>
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to