RE: Compression in Arrow - Question

mark Sun, 30 Aug 2020 10:22:38 -0700

All, 

Micah: appears my google-fu wasn't strong enough to find the previous thread, 
so thanks for pointing that out.

There is definitely a tradeoff between processing speed and compression,  
however I feel there is a use case for 'small in memory footprint'  independent 
of  'high speed processing'.  
Though I appreciate arrow team may not want to address that, given the focus on 
processing speed.   ( can't be all things to everyone. )

Personally, I think adding in the programming interfaces to handle compressed 
in-mem arrays would be a good thing, as well as the 'in flight' ones. 

For reference, my specific use case is handing large datasets [1] of varying 
types [2] to the browser  for plotting, inc scrolling over them,  using WASM 
(currently in GO). 
Both network bandwidth to browsers, and browser memory is always problematic, 
esp on mobile devices,  hence the desire to compress, and keep it compressed on 
arrival.  And minimize number of in-mem copies needed

The access to the data is either. 
 A: forward read from a certain point for a range,  to draw.   (that point and 
range changes with scroll and zoom)
 B: Random access for tooltips.    (Value of 'n' columns at index  'y') 
   Both can potentially be efficient enough based on selection of the block 
sizes or other internal boundaries  / search method. 

Note: Compressing potentially makes my 'other' problem even harder, which best 
method for appending inbound realtime sensor data into the in-memory model.    
Still thinking about that one. 

Regards

Mark. 

[1]  Large in obviously relative:  In this case, a single plot may have  20-50 
separate time series, each with between 20k  to 10 million points each. 

[2]  The data is often  index: time,  value float,  OR  Index:Float (length 
measure), Value:Float,     But not always:   Value could be one of 
int(8,16,32,64), float(32,64), string, vector(float32/64),  etc.      Hence why 
I'm liking Arrow as the standard 'format' for this data as they can all be 
safely encoded within.

-----Original Message-----
From: Micah Kornfield <emkornfi...@gmail.com> 
Sent: Sunday, August 30, 2020 6:20 PM
To: Wes McKinney <wesmck...@gmail.com>
Cc: dev <dev@arrow.apache.org>
Subject: Re: Compression in Arrow - Question

Agreed, I think it would be useful to make sure the "compute" interfaces have 
the right hooks to support alternate encodings.

On Sunday, August 30, 2020, Wes McKinney <wesmck...@gmail.com> wrote:

> That said, there is nothing preventing the development of programming 
> interfaces for compressed / encoded data right now. When it comes to 
> transporting such data, that's when we will have to decide on what to 
> support and what new metadata structures are required.
>
> For example, we could add RLE to C++ in prototype form and then 
> convert to non-RLE when writing to IPC messages.
>
> On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield 
> <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Mark,
> > See the most recent previous discussion about alternate encodings [1].
> > This is something that in the long run should be added, I'd 
> > personally prefer to start with simpler encodings.
> >
> > I don't think we should add anything more with regard to 
> > compression/encoding until at least 3 languages support the current 
> > compression methods that are in the specification.  C++ has it
> implemented,
> > there is some work in Java and I think we should have at least one more.
> >
> > -Micah
> >
> > [1]
> > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c
> 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
> >
> > On Sat, Aug 29, 2020 at 4:04 PM <m...@markfarnan.com> wrote:
> >
> > >
> > > I was looking at compression in arrow had a couple questions.
> > >
> > > If I've understood compression currently,   it is only used  'in
> flight'
> > > in either IPC or Arrow Flight, using a block compression,  but 
> > > still decoded into Ram at the destination in full array form.  Is 
> > > this
> correct ?
> > >
> > >
> > > Given that arrow is a columnar format, has any thought been given 
> > > to an option to have the data compressed both in memory and in 
> > > flight, using
> some
> > > of the columnar techniques ?
> > >  As I deal primarily with Timeseries numerical data, I was 
> > > thinking
> about
> > > some of the algorithms from the Gorilla paper [1]  for Floats  and 
> > > Timestamps (Delta-of-Delta) or similar might be appropriate.
> > >
> > > The interface functions could  still iterate over the data and 
> > > produce
> raw
> > > values so this is transparent to users of the data, but the data 
> > > blocks/arrays in-mem are actually compressed.
> > >
> > > With this method, blocks could come out of a data base/source, 
> > > through
> the
> > > data service, across the wire (flight)  and land in the consuming 
> > > applications memory without ever being decompressed or processed 
> > > until final use.
> > >
> > >
> > > Crazy thought ?
> > >
> > >
> > > Regards
> > >
> > > Mark.
> > >
> > >
> > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> > >
> > >
>

RE: Compression in Arrow - Question

Reply via email to