Thanks Philipp,

Yeah, I probably shouldn't have said SHA1 either :)    I'm not too
concerned with a particular hash/checksum implementation.  It would be good
to have at least 1 or 2 well supported ones, and a migration path to
support more if necessary without breaking file/streaming formats for
backwards compatibility.

Best,
-Micah

On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> wrote:

> (I meant to say SHA256 instead of SHA1)
>
> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote:
>
>> Hey Micah,
>>
>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
>> computed in parallel using multiple threads) and have good experience with
>> it -- all data in Ray is checksummed this way. Initially there were
>> problems with uninitialized bits in the arrow representation, but that has
>> been resolved a while back, so there should be no blocker for this. It
>> would also be great to benchmark xxhash against CRC32 and see how they
>> compare performance wise. Initially we used SHA1 but there was non-trivial
>> overhead. Maybe there is a better implementation out there (we used
>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
>> ).
>>
>> Best,
>> Philipp.
>>
>> [1]
>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
>>
>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Arrow Dev,
>>> As we expand the use-cases for Arrow to move it more across system
>>> boundaries (Flight) and make it live longer (e.g. in the file format), it
>>> seems to make sense to build in a mechanism for data integrity
>>> verification
>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like
>>> SHA1).
>>>
>>> This can be done a backwards compatible manner for the actual data
>>> buffers
>>> by adding metadata to the headers (this could be a use-case for custom
>>> metadata but I would prefer to make it explicit).  However, to make sure
>>> we
>>> have full coverage, we would need to augment the stream [1] to be
>>> something
>>> like:
>>>
>>> <metadata_size: int32>
>>> <metadata_flatbuffer: bytes>
>>> <signature_size: int16>
>>> <metadata signature>
>>> <padding>
>>> <message body>
>>>
>>> I don't think we should require implementations to actual use this
>>> functionality but we should make it a possibility (signature size could
>>> be
>>> zero meaning no checksum/hash is provided) and have it be standardized if
>>> possible.
>>>
>>> Thoughts?
>>>
>>> Sorry if this has already been discussed but I could find anything from
>>> searching JIRA or the mailing list archive, and it doesn't look like it
>>> is
>>> in the format spec.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://arrow.apache.org/docs/ipc.html
>>>
>>

Reply via email to