(I meant to say SHA256 instead of SHA1)

On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote:

> Hey Micah,
>
> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
> computed in parallel using multiple threads) and have good experience with
> it -- all data in Ray is checksummed this way. Initially there were
> problems with uninitialized bits in the arrow representation, but that has
> been resolved a while back, so there should be no blocker for this. It
> would also be great to benchmark xxhash against CRC32 and see how they
> compare performance wise. Initially we used SHA1 but there was non-trivial
> overhead. Maybe there is a better implementation out there (we used
> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
> ).
>
> Best,
> Philipp.
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
>
> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Arrow Dev,
>> As we expand the use-cases for Arrow to move it more across system
>> boundaries (Flight) and make it live longer (e.g. in the file format), it
>> seems to make sense to build in a mechanism for data integrity
>> verification
>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like
>> SHA1).
>>
>> This can be done a backwards compatible manner for the actual data buffers
>> by adding metadata to the headers (this could be a use-case for custom
>> metadata but I would prefer to make it explicit).  However, to make sure
>> we
>> have full coverage, we would need to augment the stream [1] to be
>> something
>> like:
>>
>> <metadata_size: int32>
>> <metadata_flatbuffer: bytes>
>> <signature_size: int16>
>> <metadata signature>
>> <padding>
>> <message body>
>>
>> I don't think we should require implementations to actual use this
>> functionality but we should make it a possibility (signature size could be
>> zero meaning no checksum/hash is provided) and have it be standardized if
>> possible.
>>
>> Thoughts?
>>
>> Sorry if this has already been discussed but I could find anything from
>> searching JIRA or the mailing list archive, and it doesn't look like it is
>> in the format spec.
>>
>> Thanks,
>> Micah
>>
>> [1] https://arrow.apache.org/docs/ipc.html
>>
>

Reply via email to