Re: [DISCUSS][FORMAT] Data Integrity

Wes McKinney Mon, 15 Jul 2019 07:16:42 -0700

If we adopt the position (as we already are in practice, I think) that
the encapsulated IPC message format is the main way that we expose
data from one process to another, then having digests at the message
level seems like the simplest and most useful thing.


FWIW, the Parquet format technically provides for CRC checksums but
has never been widely implemented, so there is a certain YAGNI feeling
to doing anything complex on this.

On Fri, Jul 12, 2019 at 4:30 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
>
> Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
> > Per Antoine's recommendation.  I'm splitting off the discussion about data
> > integrity from the previous e-mail thread about the format additions [1].
> > To re-cap I made a proposal including data integrity [2] by adding a new
> > message type to the
> >
> > From the previous thread the main question was at what level to apply
> > digests to Arrow data (Message level, array, buffer or potentially some
> > hybrid).
> >
> > Some trade-offs I've thought of for each approach:
> > * Message level
> > + Simplest implementation and can be applied across all messages with the
> > pretty much the same code.
> > + Smallest amount of additional data (each digest will likely be 8-64 bytes)
> > - It lacks granularity to recover partial data from a record batch if there
> > is corruption.
>
> Also:
> - Will only apply to transmission errors using the IPC mechanism, not
> other kinds of errors that may occur
>
> > Array level:
> > + Allows for reading non-corrupted columns
> > + Allows for potentially more complicated use-cases like have different
> > compute engines "collaborate" and sign each array they computed to
> > establish a "chain-of-trust"
> > - Adds some implementation complexity. Will need different schemes for
> > message types other than RecordBatch and for message metadata.  We also
> > need to determine digest boundaries (would a complex column be consumed
> > entirely or would child arrays be separate).
>
> Also:
> - Need to compute a new checksum when slicing an array?
>
> > Buffer level:
> > More or less same issues as array but with the following other factors:
> > - The most amount of additional data
>
> It's not clear that's much of a problem (currently?), especially if
> checksumming is optional.  Arrow isn't well-suited for use cases with
> many tiny buffers...
>
> > - Its not clear if there is a benefit of detecting if a single buffer is
> > corrupted if it means we can't accurately decode the array.
>
> Also:
> + decorrelated from logical interpretation of buffer, e.g. slicing
>
> I think the possibility of a hybrid scheme should be discussed as well.
>  For example, compute physical checksums at the buffer level, then
> devise a lightweight formula for the checkum of an array based on those
> physical checksums.  And a formula for an IPC message's checksum based
> on its type (schema, record batch, dictionary...).
>
> Regards
>
> Antoine.

Re: [DISCUSS][FORMAT] Data Integrity

Reply via email to