Hi Wes, Thanks for the response. I was thinking being able to checksum everything. I agree it should be off by default. I'll put this on the back burner for now. If I can find some spare time (which won't likely be any time soon), I'll submit a PR for further discussion.
Cheers, Micah On Wed, Mar 6, 2019 at 7:07 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > It seems like the checksum could be included in the Message flatbuffer > table instead of having to add things to the protocol > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L94 > > Am I correct that computing a checksum on the message body is what is > mainly of interest? Beyond that I feel like we're getting into the > realm of truly esoteric issues > > Note that checksums are part of the Parquet file specification but > IIUC seldom used, so our mileage may vary on how useful this is. I > would probably want this turned off by default > > - Wes > > On Wed, Mar 6, 2019 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Doing some light research it looks xxhash has better cross-platform > support > > as is faster then a vanilla implementation of crc32 [1]. However, crc32c > > (a slightly different crc32 algorithm) is hardware accelerated on newer > > (circa 2016) Intel CPUs [2] and is potentially faster. > > > > [1] https://cyan4973.github.io/xxHash/ > > [2] https://github.com/Cyan4973/xxHash/issues/62 > > > > On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Thanks Philipp, > > > > > > Yeah, I probably shouldn't have said SHA1 either :) I'm not too > > > concerned with a particular hash/checksum implementation. It would be > good > > > to have at least 1 or 2 well supported ones, and a migration path to > > > support more if necessary without breaking file/streaming formats for > > > backwards compatibility. > > > > > > Best, > > > -Micah > > > > > > On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> > wrote: > > > > > >> (I meant to say SHA256 instead of SHA1) > > >> > > >> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> > wrote: > > >> > > >>> Hey Micah, > > >>> > > >>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is > > >>> computed in parallel using multiple threads) and have good > experience with > > >>> it -- all data in Ray is checksummed this way. Initially there were > > >>> problems with uninitialized bits in the arrow representation, but > that has > > >>> been resolved a while back, so there should be no blocker for this. > It > > >>> would also be great to benchmark xxhash against CRC32 and see how > they > > >>> compare performance wise. Initially we used SHA1 but there was > non-trivial > > >>> overhead. Maybe there is a better implementation out there (we used > > >>> > https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c > > >>> ). > > >>> > > >>> Best, > > >>> Philipp. > > >>> > > >>> [1] > > >>> > https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 > > >>> > > >>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield < > emkornfi...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hi Arrow Dev, > > >>>> As we expand the use-cases for Arrow to move it more across system > > >>>> boundaries (Flight) and make it live longer (e.g. in the file > format), > > >>>> it > > >>>> seems to make sense to build in a mechanism for data integrity > > >>>> verification > > >>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash > like > > >>>> SHA1). > > >>>> > > >>>> This can be done a backwards compatible manner for the actual data > > >>>> buffers > > >>>> by adding metadata to the headers (this could be a use-case for > custom > > >>>> metadata but I would prefer to make it explicit). However, to make > > >>>> sure we > > >>>> have full coverage, we would need to augment the stream [1] to be > > >>>> something > > >>>> like: > > >>>> > > >>>> <metadata_size: int32> > > >>>> <metadata_flatbuffer: bytes> > > >>>> <signature_size: int16> > > >>>> <metadata signature> > > >>>> <padding> > > >>>> <message body> > > >>>> > > >>>> I don't think we should require implementations to actual use this > > >>>> functionality but we should make it a possibility (signature size > could > > >>>> be > > >>>> zero meaning no checksum/hash is provided) and have it be > standardized > > >>>> if > > >>>> possible. > > >>>> > > >>>> Thoughts? > > >>>> > > >>>> Sorry if this has already been discussed but I could find anything > from > > >>>> searching JIRA or the mailing list archive, and it doesn't look > like it > > >>>> is > > >>>> in the format spec. > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> [1] https://arrow.apache.org/docs/ipc.html > > >>>> > > >>> >