>
> I'm currently working on adding Run-Length encoding to arrow.

Nice


> What are the intended use cases for this:
> - external engines want to provide run-length encoded data to work on
> using arrow?
>
It is more than just external engines.  Many popular file formats support
RLE encoding.  Being able to natively transfer this into arrow memory
without exploding it saves CPU time. Similarly, for things like flight it
can dramatically decreases data on the wire


> - more efficient ACERO (and possibly somewhat simplified, since we can
> now use RLE arrays to replace Scalar Datums)


It can certainly improve efficiency in some cases and if we don't have
optimization for RLE then there is less benefit to surfacing from file
formats, etc.

What data type should we use for the run-length values? int32 would
> save memory, but force us to encode very long arrays as multiple
> arrays, compared to int64. Or should we support different types?

I think we should target one type for now (int32) seems pretty reasonable
to me, but have a model that allows flexibility if we need to have more
types or other schemes (e.g. use delta encoding).


- Should we allow multiple runs of the same value following each other?
> Otherwise we would either need a pass to correct this after a lot of
> operations, or make RLE-aware versions of thier kernels.

Is there any benefit you see in disallowing it?


> - Should RLE really be a data type? Dictionary encoding set an example
> > for this, but just like it, RLE is actually a different encoding for
> > arrays of the same data type.


Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.


When I thought about this [1], I think making a separate message type in
the format and modelling encoding separate from type made sense.


[1] https://github.com/apache/arrow/pull/4815





On Tue, May 31, 2022 at 12:13 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi,
>
> Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > created a function to dictionary-encode arrays here (currently only for
> > fixed length types):
> > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> >
> > The general idea is that RLE data will be a nested data type, with a
> > single child holding a regular ArrayData of the type of the values, but
> > with the duplicate values removed. The parent contains a single buffer
> > of uint64 representing the run lengths by holding the run length of all
> > runs from the first to the current one
>
> That's an interesting approach. How do nulls work? Is the null bitmap
> stored in the child only?
>
> > What are the intended use cases for this:
> > - external engines want to provide run-length encoded data to work on
> > using arrow?
> > - more efficient ACERO (and possibly somewhat simplified, since we can
> > now use RLE arrays to replace Scalar Datums)
>
> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel
> hasn't been upgraded with a native path for RLE data, former Scalar
> Datums would now be expanded to the full RLE-decoded version before
> running the kernel...
>
> > Automatic kernel dispatch:
> > - Scalar kernels would likely just work on RLE data
>
> Unary scalar kernels may indeed just work, but n-ary kernels (where n >
> 1) would not, except in the happy case where run lengths are the same in
> all arguments.
>
> > - How should it be implemented? The current place for logic like that
> > seems to be the DispatchBest methods. These only allow to replace the
> > input types, but for this RLE scheme we would need to make the kernel
> > work on a child array of the input. This mechanism would likely need to
> > be extended a lot.
> > - For kernels which don't work on RLE data, should we automatically
> > call Encode and Decode kernels?
>
> So, as a data point, most kernels currently don't have a native path for
> dictionary-encoded data; instead, they decode the data before working on
> it.
>
> For the same reason (lack of manpower), chances are that few kernels
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a native
> path).
>
> > - Should RLE really be a data type? Dictionary encoding set an example
> > for this, but just like it, RLE is actually a different encoding for
> > arrays of the same data type.
>
> Well, Arrow C++ does not have a notion of encoding distinct from the
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on
> encoding.
>
> Regards
>
> Antoine.
>

Reply via email to