Hi Wes,
Thanks for the response.  I was thinking being able to checksum
everything.  I agree it should be off by default.  I'll put this on the
back burner for now.   If I can find some spare time (which won't likely be
any time soon), I'll submit a PR for further discussion.

Cheers,
Micah

On Wed, Mar 6, 2019 at 7:07 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> It seems like the checksum could be included in the Message flatbuffer
> table instead of having to add things to the protocol
>
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L94
>
> Am I correct that computing a checksum on the message body is what is
> mainly of interest? Beyond that I feel like we're getting into the
> realm of truly esoteric issues
>
> Note that checksums are part of the Parquet file specification but
> IIUC seldom used, so our mileage may vary on how useful this is. I
> would probably want this turned off by default
>
> - Wes
>
> On Wed, Mar 6, 2019 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Doing some light research it looks xxhash has better cross-platform
> support
> > as is faster then a vanilla implementation of crc32 [1].  However, crc32c
> > (a slightly different crc32 algorithm) is hardware accelerated on newer
> > (circa 2016) Intel CPUs [2] and is potentially faster.
> >
> > [1] https://cyan4973.github.io/xxHash/
> > [2] https://github.com/Cyan4973/xxHash/issues/62
> >
> > On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Thanks Philipp,
> > >
> > > Yeah, I probably shouldn't have said SHA1 either :)    I'm not too
> > > concerned with a particular hash/checksum implementation.  It would be
> good
> > > to have at least 1 or 2 well supported ones, and a migration path to
> > > support more if necessary without breaking file/streaming formats for
> > > backwards compatibility.
> > >
> > > Best,
> > > -Micah
> > >
> > > On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com>
> wrote:
> > >
> > >> (I meant to say SHA256 instead of SHA1)
> > >>
> > >> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com>
> wrote:
> > >>
> > >>> Hey Micah,
> > >>>
> > >>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
> > >>> computed in parallel using multiple threads) and have good
> experience with
> > >>> it -- all data in Ray is checksummed this way. Initially there were
> > >>> problems with uninitialized bits in the arrow representation, but
> that has
> > >>> been resolved a while back, so there should be no blocker for this.
> It
> > >>> would also be great to benchmark xxhash against CRC32 and see how
> they
> > >>> compare performance wise. Initially we used SHA1 but there was
> non-trivial
> > >>> overhead. Maybe there is a better implementation out there (we used
> > >>>
> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
> > >>> ).
> > >>>
> > >>> Best,
> > >>> Philipp.
> > >>>
> > >>> [1]
> > >>>
> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
> > >>>
> > >>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi Arrow Dev,
> > >>>> As we expand the use-cases for Arrow to move it more across system
> > >>>> boundaries (Flight) and make it live longer (e.g. in the file
> format),
> > >>>> it
> > >>>> seems to make sense to build in a mechanism for data integrity
> > >>>> verification
> > >>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash
> like
> > >>>> SHA1).
> > >>>>
> > >>>> This can be done a backwards compatible manner for the actual data
> > >>>> buffers
> > >>>> by adding metadata to the headers (this could be a use-case for
> custom
> > >>>> metadata but I would prefer to make it explicit).  However, to make
> > >>>> sure we
> > >>>> have full coverage, we would need to augment the stream [1] to be
> > >>>> something
> > >>>> like:
> > >>>>
> > >>>> <metadata_size: int32>
> > >>>> <metadata_flatbuffer: bytes>
> > >>>> <signature_size: int16>
> > >>>> <metadata signature>
> > >>>> <padding>
> > >>>> <message body>
> > >>>>
> > >>>> I don't think we should require implementations to actual use this
> > >>>> functionality but we should make it a possibility (signature size
> could
> > >>>> be
> > >>>> zero meaning no checksum/hash is provided) and have it be
> standardized
> > >>>> if
> > >>>> possible.
> > >>>>
> > >>>> Thoughts?
> > >>>>
> > >>>> Sorry if this has already been discussed but I could find anything
> from
> > >>>> searching JIRA or the mailing list archive, and it doesn't look
> like it
> > >>>> is
> > >>>> in the format spec.
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> [1] https://arrow.apache.org/docs/ipc.html
> > >>>>
> > >>>
>

Reply via email to