XXH3 (by the xxhash author) was recently presented, though it's still experimental for now: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
It is claimed to be significantly faster than xxhash, on all message sizes. Regards Antoine. Le 06/03/2019 à 07:06, Micah Kornfield a écrit : > Doing some light research it looks xxhash has better cross-platform support > as is faster then a vanilla implementation of crc32 [1]. However, crc32c > (a slightly different crc32 algorithm) is hardware accelerated on newer > (circa 2016) Intel CPUs [2] and is potentially faster. > > [1] https://cyan4973.github.io/xxHash/ > [2] https://github.com/Cyan4973/xxHash/issues/62 > > On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Thanks Philipp, >> >> Yeah, I probably shouldn't have said SHA1 either :) I'm not too >> concerned with a particular hash/checksum implementation. It would be good >> to have at least 1 or 2 well supported ones, and a migration path to >> support more if necessary without breaking file/streaming formats for >> backwards compatibility. >> >> Best, >> -Micah >> >> On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> wrote: >> >>> (I meant to say SHA256 instead of SHA1) >>> >>> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote: >>> >>>> Hey Micah, >>>> >>>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is >>>> computed in parallel using multiple threads) and have good experience with >>>> it -- all data in Ray is checksummed this way. Initially there were >>>> problems with uninitialized bits in the arrow representation, but that has >>>> been resolved a while back, so there should be no blocker for this. It >>>> would also be great to benchmark xxhash against CRC32 and see how they >>>> compare performance wise. Initially we used SHA1 but there was non-trivial >>>> overhead. Maybe there is a better implementation out there (we used >>>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c >>>> ). >>>> >>>> Best, >>>> Philipp. >>>> >>>> [1] >>>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 >>>> >>>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>>> Hi Arrow Dev, >>>>> As we expand the use-cases for Arrow to move it more across system >>>>> boundaries (Flight) and make it live longer (e.g. in the file format), >>>>> it >>>>> seems to make sense to build in a mechanism for data integrity >>>>> verification >>>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like >>>>> SHA1). >>>>> >>>>> This can be done a backwards compatible manner for the actual data >>>>> buffers >>>>> by adding metadata to the headers (this could be a use-case for custom >>>>> metadata but I would prefer to make it explicit). However, to make >>>>> sure we >>>>> have full coverage, we would need to augment the stream [1] to be >>>>> something >>>>> like: >>>>> >>>>> <metadata_size: int32> >>>>> <metadata_flatbuffer: bytes> >>>>> <signature_size: int16> >>>>> <metadata signature> >>>>> <padding> >>>>> <message body> >>>>> >>>>> I don't think we should require implementations to actual use this >>>>> functionality but we should make it a possibility (signature size could >>>>> be >>>>> zero meaning no checksum/hash is provided) and have it be standardized >>>>> if >>>>> possible. >>>>> >>>>> Thoughts? >>>>> >>>>> Sorry if this has already been discussed but I could find anything from >>>>> searching JIRA or the mailing list archive, and it doesn't look like it >>>>> is >>>>> in the format spec. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> [1] https://arrow.apache.org/docs/ipc.html >>>>> >>>> >