Hi Micah,
Le 05/07/2019 à 20:53, Micah Kornfield a écrit : > > Going into more details on the specific features in the PR: > > 1. > > Sparse encodings for arrays and buffers. The guiding principles behind > the suggested encodings are to support encodings that can be exploited by > compute engines for more efficient computation (I don’t think parquet style > bit-packing belongs in Arrow). How does "more efficient computation" play out for operations such as hash or join? > 2. > > Data compression. Similar to encodings but compression is solely for > reduction of data at rest/on the wire. The proposal is to allow > compression of individual buffers. Right now zstd is proposed, but I don’t > feel strongly on the specific technologies here. Is it useful at the Arrow format level? Any transmission layer can add its own compression, especially a general-purpose one such as zstd or lz4. > 4. > > Data Integrity. While the arrow file format isn’t meant for archiving > data, I think it is important to allow for optional native data integrity > checks in the format. To this end, I proposed a new “Digest” message type > that can be added after other messages to record a digest/hash of the > preceding data. I suggested xxhash, but I don’t have a strong opinion here, > as long as there is some minimal support that can potentially be expanded > later. This sounds potentially useful, though one question is whether this occurs at the table level, column level, sequential array level, etc. > As a practical matter the proposal represents a lot of work to get an MVP > working in time for 1.0.0 release (provided they are accepted by the > community), so I'd greatly appreciate if anyone wants to collaborate on > this. I don't think this is workable for 1.0.0. The plan currently is for 1.0.0 to come out reasonably "quickly" after 0.14.0, i.e. perhaps in 6-8 weeks? Regards Antoine.