Great John, I'd be interesting to hear about progress.

Also, IMO I think we should be only focusing on encoding that have the
potential to be exploited for computational benefits (not just
compressibility).  I think this is what distinguishes Arrow from other
formats like Parquet. I think this echos some others sentiments on this
thread.

Cheers,
Micah

On Fri, Jan 24, 2020 at 8:28 AM John Muehlhausen <[email protected]> wrote:

> Thanks Micah, I will see if I can find some time to explore this further.
>
> On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi John,
>> Not Wes, but my thoughts on this are as follows:
>>
>> 1. Alternate bit/byte arrangements can also be useful for processing [1]
>> in
>> addition to compression.
>> 2. I think they are quite a bit more complicated then the existing schemes
>> proposed in [2], so I think it would be more expedient to get the
>> integration hooks necessary to work with simpler encodings before going
>> with something more complex.  I believe the proposal is generic enough to
>> support this type of encoding.
>> 3. For prototyping, this seems like a potential use of the ExtensionType
>> [3] type mechanism already in the specification.
>> 4. I don't think these should be new types or part of the basic Array data
>> structure.  I think having a different container format in the form of
>> "SparseRecordBatch" (or perhaps it should be renamed to
>> EncodedRecordBatch)
>> and keeping the existing types with alternate encodings is a better
>> option.
>>
>> That being said if you have bandwidth to get this working for C++ and Java
>> we can potentially setup a separate development branch to see how it
>> evolves.  Personally, I've not brought my proposal up for discussion
>> again,
>> because I haven't had bandwidth to work on it, but I still think
>> introducing some level of alternate encodings is a good idea.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>>
>> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
>> [2] https://github.com/apache/arrow/pull/4815
>> [3]
>>
>> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types
>>
>> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <[email protected]> wrote:
>>
>> > Wes, what do you think about Arrow supporting a new suite of
>> fixed-length
>> > data types that unshuffle on column->Value(i) calls?  This would allow
>> > memory/swap compressors and memory maps backed by compressing
>> > filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>> >
>> > By doing it with new datatypes there is no separate flag to check?
>> >
>> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <[email protected]>
>> wrote:
>> >
>> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <[email protected]>
>> wrote:
>> > > >
>> > > > Again, I know very little about Parquet, so your patience is
>> > appreciated.
>> > > >
>> > > > At the moment I can Arrow/mmap a file without having anywhere
>> nearly as
>> > > > much available memory as the file size.  I can visit random place in
>> > the
>> > > > file (such as a binary search if it is ordered) and only the
>> locations
>> > > > visited by column->Value(i) are paged in.  Paging them out happens
>> > > without
>> > > > my awareness, if necessary.
>> > > >
>> > > > Does Parquet cover this use-case with the same elegance and at least
>> > > equal
>> > > > efficiency, or are there more copies/conversions?  Perhaps it
>> requires
>> > > the
>> > > > entire file to be transformed into Arrow memory at the beginning? Or
>> > on a
>> > > > batch/block basis? Or to get this I need to use a non-Arrow API for
>> > data
>> > > > element access?  Etc.
>> > >
>> > > Data has to be materialized / deserialized from the Parquet file on a
>> > > batch-wise per-column basis. The APIs we provide allow batches of
>> > > values to be read for a given subset of columns
>> > >
>> > > >
>> > > > IFF it covers the above use-case, which does not mention
>> compression or
>> > > > encoding, then I could consider whether it is interesting on those
>> > > points.
>> > >
>> > > My point really has to do with Parquet's design which is about
>> > > reducing file size. In the following blog post
>> > >
>> > > https://ursalabs.org/blog/2019-10-columnar-perf/
>> > >
>> > > I examined a dataset which is about 4GB as raw Arrow stream/file but
>> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
>> > > if you are working with filesystems that yield < 500MB/s (which
>> > > includes pretty much all cloud filesystems AFAIK). In clickstream
>> > > analytics this kind of compression ratio is not unusual.
>> > >
>> > > >
>> > > > -John
>> > > >
>> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > What's the point of having zero copy if the OS is doing the
>> > > > > decompression in kernel (which trumps the zero-copy argument)? You
>> > > > > might as well just use parquet without filesystem compression. I
>> > > > > prefer to have compression algorithm where the columnar engine can
>> > > > > benefit from it [1] than marginally improving a file-system-os
>> > > > > specific feature.
>> > > > >
>> > > > > François
>> > > > >
>> > > > > [1] Section 4.3
>> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <[email protected]>
>> > wrote:
>> > > > > >
>> > > > > > This could also have utility in memory via things like
>> zram/zswap,
>> > > right?
>> > > > > > Mac also has a memory compressor?
>> > > > > >
>> > > > > > I don't think Parquet is an option for me unless the integration
>> > with
>> > > > > Arrow
>> > > > > > is tighter than I imagine (i.e. zero-copy).  That said, I
>> confess I
>> > > know
>> > > > > > next to nothing about Parquet.
>> > > > > >
>> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
>> > [email protected]>
>> > > > > wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> > > > > > > > Perhaps related to this thread, are there any current or
>> > proposed
>> > > > > tools
>> > > > > > to
>> > > > > > > > transform columns for fixed-length data types according to a
>> > > > > "shuffle?"
>> > > > > > > >  For precedent see the implementation of the shuffle filter
>> in
>> > > hdf5.
>> > > > > > > >
>> > > > > >
>> > > > >
>> > >
>> >
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>> > > > > > > >
>> > > > > > > > For example, the column (length 3) would store bytes 00 00
>> 00
>> > 00
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00
>> 00
>> > 01
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
>> > > actually
>> > > > > the
>> > > > > > > > case).
>> > > > > > > >
>> > > > > > > > Value(1) would return 00 00 00 02 by referring to some
>> metadata
>> > > flag
>> > > > > > that
>> > > > > > > > the column is shuffled, stitching the bytes back together at
>> > call
>> > > > > time.
>> > > > > > > >
>> > > > > > > > Thus if the column pages were backed by a memory map to
>> > something
>> > > > > like
>> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
>> > > savings
>> > > > > in
>> > > > > > > > underlying disk usage due to better run lengths.
>> > > > > > > >
>> > > > > > > > It would enable a space/time tradeoff that could be useful?
>> > The
>> > > > > > filesystem
>> > > > > > > > itself cannot easily do this particular compression
>> transform
>> > > since
>> > > > > it
>> > > > > > > > benefits from knowing the shape of the data.
>> > > > > > >
>> > > > > > > For the record, there's a pull request adding this encoding to
>> > the
>> > > > > > > Parquet C++ specification.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > >
>> > > > > > > Antoine.
>> > > > >
>> > >
>> >
>>
>

Reply via email to