Thanks Philipp, Yeah, I probably shouldn't have said SHA1 either :) I'm not too concerned with a particular hash/checksum implementation. It would be good to have at least 1 or 2 well supported ones, and a migration path to support more if necessary without breaking file/streaming formats for backwards compatibility.
Best, -Micah On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> wrote: > (I meant to say SHA256 instead of SHA1) > > On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote: > >> Hey Micah, >> >> in plasma, we are using xxhash to compute a hash/checksum [1] (it is >> computed in parallel using multiple threads) and have good experience with >> it -- all data in Ray is checksummed this way. Initially there were >> problems with uninitialized bits in the arrow representation, but that has >> been resolved a while back, so there should be no blocker for this. It >> would also be great to benchmark xxhash against CRC32 and see how they >> compare performance wise. Initially we used SHA1 but there was non-trivial >> overhead. Maybe there is a better implementation out there (we used >> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c >> ). >> >> Best, >> Philipp. >> >> [1] >> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 >> >> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Hi Arrow Dev, >>> As we expand the use-cases for Arrow to move it more across system >>> boundaries (Flight) and make it live longer (e.g. in the file format), it >>> seems to make sense to build in a mechanism for data integrity >>> verification >>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like >>> SHA1). >>> >>> This can be done a backwards compatible manner for the actual data >>> buffers >>> by adding metadata to the headers (this could be a use-case for custom >>> metadata but I would prefer to make it explicit). However, to make sure >>> we >>> have full coverage, we would need to augment the stream [1] to be >>> something >>> like: >>> >>> <metadata_size: int32> >>> <metadata_flatbuffer: bytes> >>> <signature_size: int16> >>> <metadata signature> >>> <padding> >>> <message body> >>> >>> I don't think we should require implementations to actual use this >>> functionality but we should make it a possibility (signature size could >>> be >>> zero meaning no checksum/hash is provided) and have it be standardized if >>> possible. >>> >>> Thoughts? >>> >>> Sorry if this has already been discussed but I could find anything from >>> searching JIRA or the mailing list archive, and it doesn't look like it >>> is >>> in the format spec. >>> >>> Thanks, >>> Micah >>> >>> [1] https://arrow.apache.org/docs/ipc.html >>> >>