Agreed, I think it would be useful to make sure the "compute" interfaces have the right hooks to support alternate encodings.
On Sunday, August 30, 2020, Wes McKinney <wesmck...@gmail.com> wrote: > That said, there is nothing preventing the development of programming > interfaces for compressed / encoded data right now. When it comes to > transporting such data, that's when we will have to decide on what to > support and what new metadata structures are required. > > For example, we could add RLE to C++ in prototype form and then > convert to non-RLE when writing to IPC messages. > > On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Mark, > > See the most recent previous discussion about alternate encodings [1]. > > This is something that in the long run should be added, I'd personally > > prefer to start with simpler encodings. > > > > I don't think we should add anything more with regard to > > compression/encoding until at least 3 languages support the current > > compression methods that are in the specification. C++ has it > implemented, > > there is some work in Java and I think we should have at least one more. > > > > -Micah > > > > [1] > > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c > 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E > > > > On Sat, Aug 29, 2020 at 4:04 PM <m...@markfarnan.com> wrote: > > > > > > > > I was looking at compression in arrow had a couple questions. > > > > > > If I've understood compression currently, it is only used 'in > flight' > > > in either IPC or Arrow Flight, using a block compression, but still > > > decoded into Ram at the destination in full array form. Is this > correct ? > > > > > > > > > Given that arrow is a columnar format, has any thought been given to an > > > option to have the data compressed both in memory and in flight, using > some > > > of the columnar techniques ? > > > As I deal primarily with Timeseries numerical data, I was thinking > about > > > some of the algorithms from the Gorilla paper [1] for Floats and > > > Timestamps (Delta-of-Delta) or similar might be appropriate. > > > > > > The interface functions could still iterate over the data and produce > raw > > > values so this is transparent to users of the data, but the data > > > blocks/arrays in-mem are actually compressed. > > > > > > With this method, blocks could come out of a data base/source, through > the > > > data service, across the wire (flight) and land in the consuming > > > applications memory without ever being decompressed or processed until > > > final use. > > > > > > > > > Crazy thought ? > > > > > > > > > Regards > > > > > > Mark. > > > > > > > > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf > > > > > > >