Hey Micah, in plasma, we are using xxhash to compute a hash/checksum [1] (it is computed in parallel using multiple threads) and have good experience with it -- all data in Ray is checksummed this way. Initially there were problems with uninitialized bits in the arrow representation, but that has been resolved a while back, so there should be no blocker for this. It would also be great to benchmark xxhash against CRC32 and see how they compare performance wise. Initially we used SHA1 but there was non-trivial overhead. Maybe there is a better implementation out there (we used https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c).
Best, Philipp. [1] https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Arrow Dev, > As we expand the use-cases for Arrow to move it more across system > boundaries (Flight) and make it live longer (e.g. in the file format), it > seems to make sense to build in a mechanism for data integrity verification > (e.g. a checksum like CRC32 or in some cases a cryptographic hash like > SHA1). > > This can be done a backwards compatible manner for the actual data buffers > by adding metadata to the headers (this could be a use-case for custom > metadata but I would prefer to make it explicit). However, to make sure we > have full coverage, we would need to augment the stream [1] to be something > like: > > <metadata_size: int32> > <metadata_flatbuffer: bytes> > <signature_size: int16> > <metadata signature> > <padding> > <message body> > > I don't think we should require implementations to actual use this > functionality but we should make it a possibility (signature size could be > zero meaning no checksum/hash is provided) and have it be standardized if > possible. > > Thoughts? > > Sorry if this has already been discussed but I could find anything from > searching JIRA or the mailing list archive, and it doesn't look like it is > in the format spec. > > Thanks, > Micah > > [1] https://arrow.apache.org/docs/ipc.html >