Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Antoine Pitrou Sat, 06 Jul 2019 11:42:11 -0700


Hi Micah,


Le 05/07/2019 à 20:53, Micah Kornfield a écrit :
> 
> Going into more details on the specific features in the PR:
> 
>    1.
> 
>    Sparse encodings for arrays and buffers.  The guiding principles behind
>    the suggested encodings are to support encodings that can be exploited by
>    compute engines for more efficient computation (I don’t think parquet style
>    bit-packing belongs in Arrow).

How does "more efficient computation" play out for operations such as
hash or join?

>          2.
> 
>    Data compression.  Similar to encodings but compression is solely for
>    reduction of data at rest/on the wire.  The proposal is to allow
>    compression of individual buffers. Right now zstd is proposed, but I don’t
>    feel strongly on the specific technologies here.

Is it useful at the Arrow format level? Any transmission layer can add
its own compression, especially a general-purpose one such as zstd or lz4.

>    4.
> 
>    Data Integrity.  While the arrow file format isn’t meant for archiving
>    data, I think it is important to allow for optional native data integrity
>    checks in the format.  To this end, I proposed a new “Digest” message type
>    that can be added after other messages to record a digest/hash of the
>    preceding data. I suggested xxhash, but I don’t have a strong opinion here,
>    as long as there is some minimal support that can potentially be expanded
>    later.

This sounds potentially useful, though one question is whether this
occurs at the table level, column level, sequential array level, etc.

> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.

I don't think this is workable for 1.0.0.  The plan currently is for
1.0.0 to come out reasonably "quickly" after 0.14.0, i.e. perhaps in 6-8
weeks?

Regards

Antoine.

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to