(I meant to say SHA256 instead of SHA1) On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote:
> Hey Micah, > > in plasma, we are using xxhash to compute a hash/checksum [1] (it is > computed in parallel using multiple threads) and have good experience with > it -- all data in Ray is checksummed this way. Initially there were > problems with uninitialized bits in the arrow representation, but that has > been resolved a while back, so there should be no blocker for this. It > would also be great to benchmark xxhash against CRC32 and see how they > compare performance wise. Initially we used SHA1 but there was non-trivial > overhead. Maybe there is a better implementation out there (we used > https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c > ). > > Best, > Philipp. > > [1] > https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 > > On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Arrow Dev, >> As we expand the use-cases for Arrow to move it more across system >> boundaries (Flight) and make it live longer (e.g. in the file format), it >> seems to make sense to build in a mechanism for data integrity >> verification >> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like >> SHA1). >> >> This can be done a backwards compatible manner for the actual data buffers >> by adding metadata to the headers (this could be a use-case for custom >> metadata but I would prefer to make it explicit). However, to make sure >> we >> have full coverage, we would need to augment the stream [1] to be >> something >> like: >> >> <metadata_size: int32> >> <metadata_flatbuffer: bytes> >> <signature_size: int16> >> <metadata signature> >> <padding> >> <message body> >> >> I don't think we should require implementations to actual use this >> functionality but we should make it a possibility (signature size could be >> zero meaning no checksum/hash is provided) and have it be standardized if >> possible. >> >> Thoughts? >> >> Sorry if this has already been discussed but I could find anything from >> searching JIRA or the mailing list archive, and it doesn't look like it is >> in the format spec. >> >> Thanks, >> Micah >> >> [1] https://arrow.apache.org/docs/ipc.html >> >