Re: [DISCUSS] Format additions for encoding/compression
Great John, I'd be interesting to hear about progress. Also, IMO I think we should be only focusing on encoding that have the potential to be exploited for computational benefits (not just compressibility). I think this is what distinguishes Arrow from other formats like Parquet. I think this echos some others sentiments on this thread. Cheers, Micah On Fri, Jan 24, 2020 at 8:28 AM John Muehlhausen wrote: > Thanks Micah, I will see if I can find some time to explore this further. > > On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield > wrote: > >> Hi John, >> Not Wes, but my thoughts on this are as follows: >> >> 1. Alternate bit/byte arrangements can also be useful for processing [1] >> in >> addition to compression. >> 2. I think they are quite a bit more complicated then the existing schemes >> proposed in [2], so I think it would be more expedient to get the >> integration hooks necessary to work with simpler encodings before going >> with something more complex. I believe the proposal is generic enough to >> support this type of encoding. >> 3. For prototyping, this seems like a potential use of the ExtensionType >> [3] type mechanism already in the specification. >> 4. I don't think these should be new types or part of the basic Array data >> structure. I think having a different container format in the form of >> "SparseRecordBatch" (or perhaps it should be renamed to >> EncodedRecordBatch) >> and keeping the existing types with alternate encodings is a better >> option. >> >> That being said if you have bandwidth to get this working for C++ and Java >> we can potentially setup a separate development branch to see how it >> evolves. Personally, I've not brought my proposal up for discussion >> again, >> because I haven't had bandwidth to work on it, but I still think >> introducing some level of alternate encodings is a good idea. >> >> Cheers, >> Micah >> >> >> [1] >> >> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf >> [2] https://github.com/apache/arrow/pull/4815 >> [3] >> >> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types >> >> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen wrote: >> >> > Wes, what do you think about Arrow supporting a new suite of >> fixed-length >> > data types that unshuffle on column->Value(i) calls? This would allow >> > memory/swap compressors and memory maps backed by compressing >> > filesystems (ZFS) or block devices (VDO) to operate more efficiently. >> > >> > By doing it with new datatypes there is no separate flag to check? >> > >> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney >> wrote: >> > >> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen >> wrote: >> > > > >> > > > Again, I know very little about Parquet, so your patience is >> > appreciated. >> > > > >> > > > At the moment I can Arrow/mmap a file without having anywhere >> nearly as >> > > > much available memory as the file size. I can visit random place in >> > the >> > > > file (such as a binary search if it is ordered) and only the >> locations >> > > > visited by column->Value(i) are paged in. Paging them out happens >> > > without >> > > > my awareness, if necessary. >> > > > >> > > > Does Parquet cover this use-case with the same elegance and at least >> > > equal >> > > > efficiency, or are there more copies/conversions? Perhaps it >> requires >> > > the >> > > > entire file to be transformed into Arrow memory at the beginning? Or >> > on a >> > > > batch/block basis? Or to get this I need to use a non-Arrow API for >> > data >> > > > element access? Etc. >> > > >> > > Data has to be materialized / deserialized from the Parquet file on a >> > > batch-wise per-column basis. The APIs we provide allow batches of >> > > values to be read for a given subset of columns >> > > >> > > > >> > > > IFF it covers the above use-case, which does not mention >> compression or >> > > > encoding, then I could consider whether it is interesting on those >> > > points. >> > > >> > > My point really has to do with Parquet's design which is about >> > > reducing file size. In the following blog post >> > > >> > > https://ursalabs.org/blog/2019-10-columnar-perf/ >> > > >> > > I examined a dataset which is about 4GB as raw Arrow stream/file but >> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal >> > > if you are working with filesystems that yield < 500MB/s (which >> > > includes pretty much all cloud filesystems AFAIK). In clickstream >> > > analytics this kind of compression ratio is not unusual. >> > > >> > > > >> > > > -John >> > > > >> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < >> > > > fsaintjacq...@gmail.com> wrote: >> > > > >> > > > > What's the point of having zero copy if the OS is doing the >> > > > > decompression in kernel (which trumps the zero-copy argument)? You >> > > > > might as well just use parquet without filesystem compression. I >> > > > > prefer to
Re: [DISCUSS] Format additions for encoding/compression
Thanks Micah, I will see if I can find some time to explore this further. On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield wrote: > Hi John, > Not Wes, but my thoughts on this are as follows: > > 1. Alternate bit/byte arrangements can also be useful for processing [1] in > addition to compression. > 2. I think they are quite a bit more complicated then the existing schemes > proposed in [2], so I think it would be more expedient to get the > integration hooks necessary to work with simpler encodings before going > with something more complex. I believe the proposal is generic enough to > support this type of encoding. > 3. For prototyping, this seems like a potential use of the ExtensionType > [3] type mechanism already in the specification. > 4. I don't think these should be new types or part of the basic Array data > structure. I think having a different container format in the form of > "SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch) > and keeping the existing types with alternate encodings is a better option. > > That being said if you have bandwidth to get this working for C++ and Java > we can potentially setup a separate development branch to see how it > evolves. Personally, I've not brought my proposal up for discussion again, > because I haven't had bandwidth to work on it, but I still think > introducing some level of alternate encodings is a good idea. > > Cheers, > Micah > > > [1] > > https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf > [2] https://github.com/apache/arrow/pull/4815 > [3] > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types > > On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen wrote: > > > Wes, what do you think about Arrow supporting a new suite of fixed-length > > data types that unshuffle on column->Value(i) calls? This would allow > > memory/swap compressors and memory maps backed by compressing > > filesystems (ZFS) or block devices (VDO) to operate more efficiently. > > > > By doing it with new datatypes there is no separate flag to check? > > > > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney > wrote: > > > > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > > > > > Again, I know very little about Parquet, so your patience is > > appreciated. > > > > > > > > At the moment I can Arrow/mmap a file without having anywhere nearly > as > > > > much available memory as the file size. I can visit random place in > > the > > > > file (such as a binary search if it is ordered) and only the > locations > > > > visited by column->Value(i) are paged in. Paging them out happens > > > without > > > > my awareness, if necessary. > > > > > > > > Does Parquet cover this use-case with the same elegance and at least > > > equal > > > > efficiency, or are there more copies/conversions? Perhaps it > requires > > > the > > > > entire file to be transformed into Arrow memory at the beginning? Or > > on a > > > > batch/block basis? Or to get this I need to use a non-Arrow API for > > data > > > > element access? Etc. > > > > > > Data has to be materialized / deserialized from the Parquet file on a > > > batch-wise per-column basis. The APIs we provide allow batches of > > > values to be read for a given subset of columns > > > > > > > > > > > IFF it covers the above use-case, which does not mention compression > or > > > > encoding, then I could consider whether it is interesting on those > > > points. > > > > > > My point really has to do with Parquet's design which is about > > > reducing file size. In the following blog post > > > > > > https://ursalabs.org/blog/2019-10-columnar-perf/ > > > > > > I examined a dataset which is about 4GB as raw Arrow stream/file but > > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal > > > if you are working with filesystems that yield < 500MB/s (which > > > includes pretty much all cloud filesystems AFAIK). In clickstream > > > analytics this kind of compression ratio is not unusual. > > > > > > > > > > > -John > > > > > > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > > > > fsaintjacq...@gmail.com> wrote: > > > > > > > > > What's the point of having zero copy if the OS is doing the > > > > > decompression in kernel (which trumps the zero-copy argument)? You > > > > > might as well just use parquet without filesystem compression. I > > > > > prefer to have compression algorithm where the columnar engine can > > > > > benefit from it [1] than marginally improving a file-system-os > > > > > specific feature. > > > > > > > > > > François > > > > > > > > > > [1] Section 4.3 > http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen > > wrote: > > > > > > > > > > > > This could also have utility in memory via things like > zram/zswap, > > > right? > > > > > > Mac also has a memory compressor? > > >
Re: [DISCUSS] Format additions for encoding/compression
Hi John, Not Wes, but my thoughts on this are as follows: 1. Alternate bit/byte arrangements can also be useful for processing [1] in addition to compression. 2. I think they are quite a bit more complicated then the existing schemes proposed in [2], so I think it would be more expedient to get the integration hooks necessary to work with simpler encodings before going with something more complex. I believe the proposal is generic enough to support this type of encoding. 3. For prototyping, this seems like a potential use of the ExtensionType [3] type mechanism already in the specification. 4. I don't think these should be new types or part of the basic Array data structure. I think having a different container format in the form of "SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch) and keeping the existing types with alternate encodings is a better option. That being said if you have bandwidth to get this working for C++ and Java we can potentially setup a separate development branch to see how it evolves. Personally, I've not brought my proposal up for discussion again, because I haven't had bandwidth to work on it, but I still think introducing some level of alternate encodings is a good idea. Cheers, Micah [1] https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf [2] https://github.com/apache/arrow/pull/4815 [3] https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen wrote: > Wes, what do you think about Arrow supporting a new suite of fixed-length > data types that unshuffle on column->Value(i) calls? This would allow > memory/swap compressors and memory maps backed by compressing > filesystems (ZFS) or block devices (VDO) to operate more efficiently. > > By doing it with new datatypes there is no separate flag to check? > > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > > > Again, I know very little about Parquet, so your patience is > appreciated. > > > > > > At the moment I can Arrow/mmap a file without having anywhere nearly as > > > much available memory as the file size. I can visit random place in > the > > > file (such as a binary search if it is ordered) and only the locations > > > visited by column->Value(i) are paged in. Paging them out happens > > without > > > my awareness, if necessary. > > > > > > Does Parquet cover this use-case with the same elegance and at least > > equal > > > efficiency, or are there more copies/conversions? Perhaps it requires > > the > > > entire file to be transformed into Arrow memory at the beginning? Or > on a > > > batch/block basis? Or to get this I need to use a non-Arrow API for > data > > > element access? Etc. > > > > Data has to be materialized / deserialized from the Parquet file on a > > batch-wise per-column basis. The APIs we provide allow batches of > > values to be read for a given subset of columns > > > > > > > > IFF it covers the above use-case, which does not mention compression or > > > encoding, then I could consider whether it is interesting on those > > points. > > > > My point really has to do with Parquet's design which is about > > reducing file size. In the following blog post > > > > https://ursalabs.org/blog/2019-10-columnar-perf/ > > > > I examined a dataset which is about 4GB as raw Arrow stream/file but > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal > > if you are working with filesystems that yield < 500MB/s (which > > includes pretty much all cloud filesystems AFAIK). In clickstream > > analytics this kind of compression ratio is not unusual. > > > > > > > > -John > > > > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > > > fsaintjacq...@gmail.com> wrote: > > > > > > > What's the point of having zero copy if the OS is doing the > > > > decompression in kernel (which trumps the zero-copy argument)? You > > > > might as well just use parquet without filesystem compression. I > > > > prefer to have compression algorithm where the columnar engine can > > > > benefit from it [1] than marginally improving a file-system-os > > > > specific feature. > > > > > > > > François > > > > > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen > wrote: > > > > > > > > > > This could also have utility in memory via things like zram/zswap, > > right? > > > > > Mac also has a memory compressor? > > > > > > > > > > I don't think Parquet is an option for me unless the integration > with > > > > Arrow > > > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I > > know > > > > > next to nothing about Parquet. > > > > > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou < > anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > > > >
Re: [DISCUSS] Format additions for encoding/compression
Wes, what do you think about Arrow supporting a new suite of fixed-length data types that unshuffle on column->Value(i) calls? This would allow memory/swap compressors and memory maps backed by compressing filesystems (ZFS) or block devices (VDO) to operate more efficiently. By doing it with new datatypes there is no separate flag to check? On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > Again, I know very little about Parquet, so your patience is appreciated. > > > > At the moment I can Arrow/mmap a file without having anywhere nearly as > > much available memory as the file size. I can visit random place in the > > file (such as a binary search if it is ordered) and only the locations > > visited by column->Value(i) are paged in. Paging them out happens > without > > my awareness, if necessary. > > > > Does Parquet cover this use-case with the same elegance and at least > equal > > efficiency, or are there more copies/conversions? Perhaps it requires > the > > entire file to be transformed into Arrow memory at the beginning? Or on a > > batch/block basis? Or to get this I need to use a non-Arrow API for data > > element access? Etc. > > Data has to be materialized / deserialized from the Parquet file on a > batch-wise per-column basis. The APIs we provide allow batches of > values to be read for a given subset of columns > > > > > IFF it covers the above use-case, which does not mention compression or > > encoding, then I could consider whether it is interesting on those > points. > > My point really has to do with Parquet's design which is about > reducing file size. In the following blog post > > https://ursalabs.org/blog/2019-10-columnar-perf/ > > I examined a dataset which is about 4GB as raw Arrow stream/file but > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal > if you are working with filesystems that yield < 500MB/s (which > includes pretty much all cloud filesystems AFAIK). In clickstream > analytics this kind of compression ratio is not unusual. > > > > > -John > > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > > fsaintjacq...@gmail.com> wrote: > > > > > What's the point of having zero copy if the OS is doing the > > > decompression in kernel (which trumps the zero-copy argument)? You > > > might as well just use parquet without filesystem compression. I > > > prefer to have compression algorithm where the columnar engine can > > > benefit from it [1] than marginally improving a file-system-os > > > specific feature. > > > > > > François > > > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > > > > > This could also have utility in memory via things like zram/zswap, > right? > > > > Mac also has a memory compressor? > > > > > > > > I don't think Parquet is an option for me unless the integration with > > > Arrow > > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I > know > > > > next to nothing about Parquet. > > > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > > > wrote: > > > > > > > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > > > Perhaps related to this thread, are there any current or proposed > > > tools > > > > to > > > > > > transform columns for fixed-length data types according to a > > > "shuffle?" > > > > > > For precedent see the implementation of the shuffle filter in > hdf5. > > > > > > > > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 > 00 > > > 00 > > > > 00 > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 > 00 > > > 00 > > > > 00 > > > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not > actually > > > the > > > > > > case). > > > > > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata > flag > > > > that > > > > > > the column is shuffled, stitching the bytes back together at call > > > time. > > > > > > > > > > > > Thus if the column pages were backed by a memory map to something > > > like > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% > savings > > > in > > > > > > underlying disk usage due to better run lengths. > > > > > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > > > filesystem > > > > > > itself cannot easily do this particular compression transform > since > > > it > > > > > > benefits from knowing the shape of the data. > > > > > > > > > > For the record, there's a pull request adding this encoding to the > > > > > Parquet C++ specification. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > >
Re: [DISCUSS] Format additions for encoding/compression
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > Again, I know very little about Parquet, so your patience is appreciated. > > At the moment I can Arrow/mmap a file without having anywhere nearly as > much available memory as the file size. I can visit random place in the > file (such as a binary search if it is ordered) and only the locations > visited by column->Value(i) are paged in. Paging them out happens without > my awareness, if necessary. > > Does Parquet cover this use-case with the same elegance and at least equal > efficiency, or are there more copies/conversions? Perhaps it requires the > entire file to be transformed into Arrow memory at the beginning? Or on a > batch/block basis? Or to get this I need to use a non-Arrow API for data > element access? Etc. Data has to be materialized / deserialized from the Parquet file on a batch-wise per-column basis. The APIs we provide allow batches of values to be read for a given subset of columns > > IFF it covers the above use-case, which does not mention compression or > encoding, then I could consider whether it is interesting on those points. My point really has to do with Parquet's design which is about reducing file size. In the following blog post https://ursalabs.org/blog/2019-10-columnar-perf/ I examined a dataset which is about 4GB as raw Arrow stream/file but only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal if you are working with filesystems that yield < 500MB/s (which includes pretty much all cloud filesystems AFAIK). In clickstream analytics this kind of compression ratio is not unusual. > > -John > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > What's the point of having zero copy if the OS is doing the > > decompression in kernel (which trumps the zero-copy argument)? You > > might as well just use parquet without filesystem compression. I > > prefer to have compression algorithm where the columnar engine can > > benefit from it [1] than marginally improving a file-system-os > > specific feature. > > > > François > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > > > This could also have utility in memory via things like zram/zswap, right? > > > Mac also has a memory compressor? > > > > > > I don't think Parquet is an option for me unless the integration with > > Arrow > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > > next to nothing about Parquet. > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > > wrote: > > > > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > > Perhaps related to this thread, are there any current or proposed > > tools > > > to > > > > > transform columns for fixed-length data types according to a > > "shuffle?" > > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 > > 00 > > > 00 > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 > > 00 > > > 00 > > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually > > the > > > > > case). > > > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > > that > > > > > the column is shuffled, stitching the bytes back together at call > > time. > > > > > > > > > > Thus if the column pages were backed by a memory map to something > > like > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings > > in > > > > > underlying disk usage due to better run lengths. > > > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > > filesystem > > > > > itself cannot easily do this particular compression transform since > > it > > > > > benefits from knowing the shape of the data. > > > > > > > > For the record, there's a pull request adding this encoding to the > > > > Parquet C++ specification. > > > > > > > > Regards > > > > > > > > Antoine. > >
Re: [DISCUSS] Format additions for encoding/compression
Again, I know very little about Parquet, so your patience is appreciated. At the moment I can Arrow/mmap a file without having anywhere nearly as much available memory as the file size. I can visit random place in the file (such as a binary search if it is ordered) and only the locations visited by column->Value(i) are paged in. Paging them out happens without my awareness, if necessary. Does Parquet cover this use-case with the same elegance and at least equal efficiency, or are there more copies/conversions? Perhaps it requires the entire file to be transformed into Arrow memory at the beginning? Or on a batch/block basis? Or to get this I need to use a non-Arrow API for data element access? Etc. IFF it covers the above use-case, which does not mention compression or encoding, then I could consider whether it is interesting on those points. -John On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > What's the point of having zero copy if the OS is doing the > decompression in kernel (which trumps the zero-copy argument)? You > might as well just use parquet without filesystem compression. I > prefer to have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > This could also have utility in memory via things like zram/zswap, right? > > Mac also has a memory compressor? > > > > I don't think Parquet is an option for me unless the integration with > Arrow > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > next to nothing about Parquet. > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > wrote: > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > Perhaps related to this thread, are there any current or proposed > tools > > to > > > > transform columns for fixed-length data types according to a > "shuffle?" > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 > 00 > > 00 > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 > 00 > > 00 > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually > the > > > > case). > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > that > > > > the column is shuffled, stitching the bytes back together at call > time. > > > > > > > > Thus if the column pages were backed by a memory map to something > like > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings > in > > > > underlying disk usage due to better run lengths. > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > filesystem > > > > itself cannot easily do this particular compression transform since > it > > > > benefits from knowing the shape of the data. > > > > > > For the record, there's a pull request adding this encoding to the > > > Parquet C++ specification. > > > > > > Regards > > > > > > Antoine. >
Re: [DISCUSS] Format additions for encoding/compression
Parquet is most relevant in scenarios filesystem IO is constrained (spinning rust HDD, network FS, cloud storage / S3 / GCS). For those use cases memory-mapped Arrow is not viable. Against local NVMe (> 2000 MB/s read throughput) your mileage may vary. On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques wrote: > > What's the point of having zero copy if the OS is doing the > decompression in kernel (which trumps the zero-copy argument)? You > might as well just use parquet without filesystem compression. I > prefer to have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > This could also have utility in memory via things like zram/zswap, right? > > Mac also has a memory compressor? > > > > I don't think Parquet is an option for me unless the integration with Arrow > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > next to nothing about Parquet. > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > Perhaps related to this thread, are there any current or proposed tools > > to > > > > transform columns for fixed-length data types according to a "shuffle?" > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 > > 00 > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 > > 00 > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > > > > case). > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > that > > > > the column is shuffled, stitching the bytes back together at call time. > > > > > > > > Thus if the column pages were backed by a memory map to something like > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > > > > underlying disk usage due to better run lengths. > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > filesystem > > > > itself cannot easily do this particular compression transform since it > > > > benefits from knowing the shape of the data. > > > > > > For the record, there's a pull request adding this encoding to the > > > Parquet C++ specification. > > > > > > Regards > > > > > > Antoine.
Re: [DISCUSS] Format additions for encoding/compression
This could also have utility in memory via things like zram/zswap, right? Mac also has a memory compressor? I don't think Parquet is an option for me unless the integration with Arrow is tighter than I imagine (i.e. zero-copy). That said, I confess I know next to nothing about Parquet. On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > Perhaps related to this thread, are there any current or proposed tools to > > transform columns for fixed-length data types according to a "shuffle?" > > For precedent see the implementation of the shuffle filter in hdf5. > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > > case). > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag that > > the column is shuffled, stitching the bytes back together at call time. > > > > Thus if the column pages were backed by a memory map to something like > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > > underlying disk usage due to better run lengths. > > > > It would enable a space/time tradeoff that could be useful? The filesystem > > itself cannot easily do this particular compression transform since it > > benefits from knowing the shape of the data. > > For the record, there's a pull request adding this encoding to the > Parquet C++ specification. > > Regards > > Antoine.
Re: [DISCUSS] Format additions for encoding/compression
Forgot to give the URL: https://github.com/apache/arrow/pull/6005 Regards Antoine. Le 23/01/2020 à 18:23, Antoine Pitrou a écrit : > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> Perhaps related to this thread, are there any current or proposed tools to >> transform columns for fixed-length data types according to a "shuffle?" >> For precedent see the implementation of the shuffle filter in hdf5. >> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf >> >> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 >> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 >> 02 00 00 00 03 (I'm writing big-endian even if that is not actually the >> case). >> >> Value(1) would return 00 00 00 02 by referring to some metadata flag that >> the column is shuffled, stitching the bytes back together at call time. >> >> Thus if the column pages were backed by a memory map to something like >> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in >> underlying disk usage due to better run lengths. >> >> It would enable a space/time tradeoff that could be useful? The filesystem >> itself cannot easily do this particular compression transform since it >> benefits from knowing the shape of the data. > > For the record, there's a pull request adding this encoding to the > Parquet C++ specification. > > Regards > > Antoine. >
Re: [DISCUSS] Format additions for encoding/compression
Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > Perhaps related to this thread, are there any current or proposed tools to > transform columns for fixed-length data types according to a "shuffle?" > For precedent see the implementation of the shuffle filter in hdf5. > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > case). > > Value(1) would return 00 00 00 02 by referring to some metadata flag that > the column is shuffled, stitching the bytes back together at call time. > > Thus if the column pages were backed by a memory map to something like > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > underlying disk usage due to better run lengths. > > It would enable a space/time tradeoff that could be useful? The filesystem > itself cannot easily do this particular compression transform since it > benefits from knowing the shape of the data. For the record, there's a pull request adding this encoding to the Parquet C++ specification. Regards Antoine.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5. https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 02 00 00 00 03 (I'm writing big-endian even if that is not actually the case). Value(1) would return 00 00 00 02 by referring to some metadata flag that the column is shuffled, stitching the bytes back together at call time. Thus if the column pages were backed by a memory map to something like zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in underlying disk usage due to better run lengths. It would enable a space/time tradeoff that could be useful? The filesystem itself cannot easily do this particular compression transform since it benefits from knowing the shape of the data. -John On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield wrote: > > Hi Ippokratis, > Thank you for the feedback, I have some questions based on the links you > provided. > > > > I think that lightweight encodings (like the FrameOfReference Micah > > suggests) do make a lot of sense for Arrow. There are a few implementations > > of those in commercial systems. One related paper in the literature is > > http://www.cs.columbia.edu/~orestis/damon15.pdf > > > This paper seems to suggest more complex encodings I was imagining for the > the first implementation. Specifically, I proposed using only codes that > are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have > the dense bit-packing in an initial version? > > > > > I would actually also look into some order-preserving dictionary encodings > > for strings that also allow vectorized processing (predicates, joins, ..) > > on encoded data, e.g. see > > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > > . > > The IPC spec [1] already has some metadata about the ordering of > dictionaries, but this might not be sufficient. The paper linked here > seems to recommend two things: > 1. Treating dictionaries as explicit mappings between value and integer > code today is is implicit because the dictionaries are lists indexed by > code. It seems like for forward-compatibility we should add a type enum to > the Dictionary Encoding metadata. > 2. Adding indexes to the dictionaries. For this, did you imagine the > indexes would be transferred or something built up on receiving batches? > > Arrow can be used as during shuffles for distributed joins/aggs and being > > able to operate on encoded data yields benefits (e.g. > > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). > > The main take-away I got after skimming this paper, as it relates to > encodings, is that encodings (including dictionary) should be dynamic per > batch. The other interesting question it raises with respect to Arrow is > one of the techniques used is delta-encoding. I believe delta encoding > requires linear time access. The dense representations in Arrow was > designed to have constant time access to elements. One open question on how > far we want to relax this requirement for encoded columns. My proposal > uses a form of RLE that provide O(Log(N)) access). > > Cheers, > Micah > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285 > > On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis > wrote: > > > I think that lightweight encodings (like the FrameOfReference Micah > > suggests) do make a lot of sense for Arrow. There are a few implementations > > of those in commercial systems. One related paper in the literature is > > http://www.cs.columbia.edu/~orestis/damon15.pdf > > > > I would actually also look into some order-preserving dictionary encodings > > for strings that also allow vectorized processing (predicates, joins, ..) > > on encoded data, e.g. see > > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > > . > > > > Arrow can be used as during shuffles for distributed joins/aggs and being > > able to operate on encoded data yields benefits (e.g. > > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). > > > > Thanks, > > -Ippokratis. > > > > > > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield > > wrote: > > > >> > > >> > It's not just computation libraries, it's any library peeking inside > >> > Arrow data. Currently, the Arrow data types are simple, which makes it > >> > easy and non-intimidating to build data processing utilities around > >> > them. If we start adding sophisticated encodings, we also raise the > >> > cost of supporting Arrow for third-party libraries. > >> > >> > >> This is another legitimate concern about complexity. > >> > >> To try to limit complexity. I simplified the proposal PR [1] to only have >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Hi Ippokratis, Thank you for the feedback, I have some questions based on the links you provided. > I think that lightweight encodings (like the FrameOfReference Micah > suggests) do make a lot of sense for Arrow. There are a few implementations > of those in commercial systems. One related paper in the literature is > http://www.cs.columbia.edu/~orestis/damon15.pdf This paper seems to suggest more complex encodings I was imagining for the the first implementation. Specifically, I proposed using only codes that are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have the dense bit-packing in an initial version? > > I would actually also look into some order-preserving dictionary encodings > for strings that also allow vectorized processing (predicates, joins, ..) > on encoded data, e.g. see > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > . The IPC spec [1] already has some metadata about the ordering of dictionaries, but this might not be sufficient. The paper linked here seems to recommend two things: 1. Treating dictionaries as explicit mappings between value and integer code today is is implicit because the dictionaries are lists indexed by code. It seems like for forward-compatibility we should add a type enum to the Dictionary Encoding metadata. 2. Adding indexes to the dictionaries. For this, did you imagine the indexes would be transferred or something built up on receiving batches? Arrow can be used as during shuffles for distributed joins/aggs and being > able to operate on encoded data yields benefits (e.g. > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). The main take-away I got after skimming this paper, as it relates to encodings, is that encodings (including dictionary) should be dynamic per batch. The other interesting question it raises with respect to Arrow is one of the techniques used is delta-encoding. I believe delta encoding requires linear time access. The dense representations in Arrow was designed to have constant time access to elements. One open question on how far we want to relax this requirement for encoded columns. My proposal uses a form of RLE that provide O(Log(N)) access). Cheers, Micah [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285 On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis wrote: > I think that lightweight encodings (like the FrameOfReference Micah > suggests) do make a lot of sense for Arrow. There are a few implementations > of those in commercial systems. One related paper in the literature is > http://www.cs.columbia.edu/~orestis/damon15.pdf > > I would actually also look into some order-preserving dictionary encodings > for strings that also allow vectorized processing (predicates, joins, ..) > on encoded data, e.g. see > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > . > > Arrow can be used as during shuffles for distributed joins/aggs and being > able to operate on encoded data yields benefits (e.g. > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). > > Thanks, > -Ippokratis. > > > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield > wrote: > >> > >> > It's not just computation libraries, it's any library peeking inside >> > Arrow data. Currently, the Arrow data types are simple, which makes it >> > easy and non-intimidating to build data processing utilities around >> > them. If we start adding sophisticated encodings, we also raise the >> > cost of supporting Arrow for third-party libraries. >> >> >> This is another legitimate concern about complexity. >> >> To try to limit complexity. I simplified the proposal PR [1] to only have >> 1 >> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding >> scheme (RLE) that I think will have the most benefit if exploited >> properly. Compression is removed. >> >> I'd like to get closure on the proposal one way or another. I think now >> the question to be answered is if we are willing to introduce the >> additional complexity for the performance improvements they can yield? Is >> there more data that people would like to see that would influence their >> decision? >> >> Thanks, >> Micah >> >> [1] https://github.com/apache/arrow/pull/4815 >> >> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou >> wrote: >> >> > On Mon, 22 Jul 2019 08:40:08 -0700 >> > Brian Hulette wrote: >> > > To me, the most important aspect of this proposal is the addition of >> > sparse >> > > encodings, and I'm curious if there are any more objections to that >> > > specifically. So far I believe the only one is that it will make >> > > computation libraries more complicated. This is absolutely true, but I >> > > think it's worth that cost. >> > >> > It's not just computation libraries, it's any library peeking inside >> > Arrow data. Currently, the Arrow data types are simple, which makes it >> > easy and non-intimidating to build data processing utilities around >> > them. If we
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
I think that lightweight encodings (like the FrameOfReference Micah suggests) do make a lot of sense for Arrow. There are a few implementations of those in commercial systems. One related paper in the literature is http://www.cs.columbia.edu/~orestis/damon15.pdf I would actually also look into some order-preserving dictionary encodings for strings that also allow vectorized processing (predicates, joins, ..) on encoded data, e.g. see https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf . Arrow can be used as during shuffles for distributed joins/aggs and being able to operate on encoded data yields benefits (e.g. http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). Thanks, -Ippokratis. On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield wrote: > > > > It's not just computation libraries, it's any library peeking inside > > Arrow data. Currently, the Arrow data types are simple, which makes it > > easy and non-intimidating to build data processing utilities around > > them. If we start adding sophisticated encodings, we also raise the > > cost of supporting Arrow for third-party libraries. > > > This is another legitimate concern about complexity. > > To try to limit complexity. I simplified the proposal PR [1] to only have 1 > buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding > scheme (RLE) that I think will have the most benefit if exploited > properly. Compression is removed. > > I'd like to get closure on the proposal one way or another. I think now > the question to be answered is if we are willing to introduce the > additional complexity for the performance improvements they can yield? Is > there more data that people would like to see that would influence their > decision? > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/4815 > > On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou > wrote: > > > On Mon, 22 Jul 2019 08:40:08 -0700 > > Brian Hulette wrote: > > > To me, the most important aspect of this proposal is the addition of > > sparse > > > encodings, and I'm curious if there are any more objections to that > > > specifically. So far I believe the only one is that it will make > > > computation libraries more complicated. This is absolutely true, but I > > > think it's worth that cost. > > > > It's not just computation libraries, it's any library peeking inside > > Arrow data. Currently, the Arrow data types are simple, which makes it > > easy and non-intimidating to build data processing utilities around > > them. If we start adding sophisticated encodings, we also raise the > > cost of supporting Arrow for third-party libraries. > > > > Regards > > > > Antoine. > > > > > > > -- -Ippokratis.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
> > It's not just computation libraries, it's any library peeking inside > Arrow data. Currently, the Arrow data types are simple, which makes it > easy and non-intimidating to build data processing utilities around > them. If we start adding sophisticated encodings, we also raise the > cost of supporting Arrow for third-party libraries. This is another legitimate concern about complexity. To try to limit complexity. I simplified the proposal PR [1] to only have 1 buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding scheme (RLE) that I think will have the most benefit if exploited properly. Compression is removed. I'd like to get closure on the proposal one way or another. I think now the question to be answered is if we are willing to introduce the additional complexity for the performance improvements they can yield? Is there more data that people would like to see that would influence their decision? Thanks, Micah [1] https://github.com/apache/arrow/pull/4815 On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou wrote: > On Mon, 22 Jul 2019 08:40:08 -0700 > Brian Hulette wrote: > > To me, the most important aspect of this proposal is the addition of > sparse > > encodings, and I'm curious if there are any more objections to that > > specifically. So far I believe the only one is that it will make > > computation libraries more complicated. This is absolutely true, but I > > think it's worth that cost. > > It's not just computation libraries, it's any library peeking inside > Arrow data. Currently, the Arrow data types are simple, which makes it > easy and non-intimidating to build data processing utilities around > them. If we start adding sophisticated encodings, we also raise the > cost of supporting Arrow for third-party libraries. > > Regards > > Antoine. > > >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
On Mon, 22 Jul 2019 08:40:08 -0700 Brian Hulette wrote: > To me, the most important aspect of this proposal is the addition of sparse > encodings, and I'm curious if there are any more objections to that > specifically. So far I believe the only one is that it will make > computation libraries more complicated. This is absolutely true, but I > think it's worth that cost. It's not just computation libraries, it's any library peeking inside Arrow data. Currently, the Arrow data types are simple, which makes it easy and non-intimidating to build data processing utilities around them. If we start adding sophisticated encodings, we also raise the cost of supporting Arrow for third-party libraries. Regards Antoine.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
To me, the most important aspect of this proposal is the addition of sparse encodings, and I'm curious if there are any more objections to that specifically. So far I believe the only one is that it will make computation libraries more complicated. This is absolutely true, but I think it's worth that cost. It's been suggested on this list and elsewhere [1] that sparse encodings that can be operated on without fully decompressing should be added to the Arrow format. The longer we continue to develop computation libraries without considering those schemes, the harder it will be to add them. [1] https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney wrote: > On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou > wrote: > > > > On Fri, 12 Jul 2019 20:37:15 -0700 > > Micah Kornfield wrote: > > > > > > If the latter, I wonder why Parquet cannot simply be used instead of > > > > reinventing something similar but different. > > > > > > This is a reasonable point. However there is continuum here between > file > > > size and read and write times. Parquet will likely always be the > smallest > > > with the largest times to convert to and from Arrow. An uncompressed > > > Feather/Arrow file will likely always take the most space but will much > > > faster conversion times. > > > > I'm curious whether the Parquet conversion times are inherent to the > > Parquet format or due to inefficiencies in the implementation. > > > > Parquet is fundamentally more complex to decode. Consider several > layers of logic that must happen for values to end up in the right > place > > * Data pages are usually compressed, and a column consists of many > data pages each having a Thrift header that must be deserialized > * Values are usually dictionary-encoded, dictionary indices are > encoded using hybrid bit-packed / RLE scheme > * Null/not-null is encoded in definition levels > * Only non-null values are stored, so when decoding to Arrow, values > have to be "moved into place" > > The current C++ implementation could certainly be made faster. One > consideration with Parquet is that the files are much smaller, so when > you are reading them over the network the effective end-to-end time > including IO and deserialization will frequently win. > > > Regards > > > > Antoine. > > > > >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou wrote: > > On Fri, 12 Jul 2019 20:37:15 -0700 > Micah Kornfield wrote: > > > > If the latter, I wonder why Parquet cannot simply be used instead of > > > reinventing something similar but different. > > > > This is a reasonable point. However there is continuum here between file > > size and read and write times. Parquet will likely always be the smallest > > with the largest times to convert to and from Arrow. An uncompressed > > Feather/Arrow file will likely always take the most space but will much > > faster conversion times. > > I'm curious whether the Parquet conversion times are inherent to the > Parquet format or due to inefficiencies in the implementation. > Parquet is fundamentally more complex to decode. Consider several layers of logic that must happen for values to end up in the right place * Data pages are usually compressed, and a column consists of many data pages each having a Thrift header that must be deserialized * Values are usually dictionary-encoded, dictionary indices are encoded using hybrid bit-packed / RLE scheme * Null/not-null is encoded in definition levels * Only non-null values are stored, so when decoding to Arrow, values have to be "moved into place" The current C++ implementation could certainly be made faster. One consideration with Parquet is that the files are much smaller, so when you are reading them over the network the effective end-to-end time including IO and deserialization will frequently win. > Regards > > Antoine. > >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
On Fri, 12 Jul 2019 20:37:15 -0700 Micah Kornfield wrote: > > If the latter, I wonder why Parquet cannot simply be used instead of > > reinventing something similar but different. > > This is a reasonable point. However there is continuum here between file > size and read and write times. Parquet will likely always be the smallest > with the largest times to convert to and from Arrow. An uncompressed > Feather/Arrow file will likely always take the most space but will much > faster conversion times. I'm curious whether the Parquet conversion times are inherent to the Parquet format or due to inefficiencies in the implementation. Regards Antoine.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Hi Antoine, I think Liya Fan raised some good points in his reply but I'd like to answer your questions directly. > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? I tried to separate the two concepts into Encodings (things Arrow can operate directly on) and Compression (solely for transport). While there is some overlap I think the two features can be considered separately. For each encoding there is additional implementation complexity to properly exploit it. However, the benefit for some workloads can be large [1][2]. If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. This is a reasonable point. However there is continuum here between file size and read and write times. Parquet will likely always be the smallest with the largest times to convert to and from Arrow. An uncompressed Feather/Arrow file will likely always take the most space but will much faster conversion times.The question is whether a buffer level or some other sub-file level compression scheme provides enough values compared with compressing of the entire Feather file. This is somewhat hand-wavy but if we feel we might want to investigate this further I can write some benchmarks to quantify the differences. Cheers, Micah [1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf [2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme. For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to directly work on RLE data, with better locality and fewer cache misses. Best, Liya Fan On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > OK, I've created a separate thread for data integrity/digests [1], and > retitled this thread to continue the discussion on compression and > encodings. As a reminder the PR for the format additions [2] suggested a > new SparseRecordBatch that would allow for the following features: > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > (e.g. narrower bit-width integers) > 2. Compression at the buffer level > 3. Eliding all metadata and data for empty columns. So the question is whether this really needs to be in the in-memory format, i.e. is it desired to operate directly on this compressed format, or is it solely for transport? If the latter, I wonder why Parquet cannot simply be used instead of reinventing something similar but different. Regards Antoine.
[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
OK, I've created a separate thread for data integrity/digests [1], and retitled this thread to continue the discussion on compression and encodings. As a reminder the PR for the format additions [2] suggested a new SparseRecordBatch that would allow for the following features: 1. Different data encodings at the Array (e.g. RLE) and Buffer levels (e.g. narrower bit-width integers) 2. Compression at the buffer level 3. Eliding all metadata and data for empty columns. To recap my understanding of the highlights discussion so far: Encodings: There are some concerns over efficiency of some of the encodings in different scenarios. * Eliding null values makes many algorithms less efficient * Joins might become harder with these encodings. * Also the additional code complexity came up on the Arrow sync call. Compression: - Buffer level compression might be too small a granularity for data compression. - General purpose compression at this level might not add much value, so it might be better to keep it at the transport level. Alternative designs: * Put buffer level compression in specific transports (e.g. flight) * Try to use the extension mechanism to support different encodings Thanks, Micah [1] https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E [2] https://github.com/apache/arrow/pull/4815 On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou wrote: > > I think it would be worthwhile to split the discussion into two separate > threads. One thread for compression & encodings (which are related or > even the same topic), one thread for data integrity. > > Regards > > Antoine. > > > Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > > > - Compression: > >* Use parquet for random access to data elements. > >- This is one option, the main downside I see to this is > generally > > higher encoding/decoding costs. Per below, I think it is reasonable to > > wait until we have more data to add compression into the the spec. > >* Have the transport layer do buffer specific compression: > > - I'm not a fan of this approach. Once nice thing about the > current > > communication protocols is once you strip away "framing" data all the > byte > > streams are equivalent. I think the simplicity that follows in code from > > this is a nice feature. > > > > > > *Computational efficiency of array encodings:* > > > >> How does "more efficient computation" play out for operations such as > >> hash or join? > > > > You would still need to likely materialize rows in most case. In some > > "join" cases the sparse encoding of the null bitmap buffer could be a win > > because it serves as an index to non-null values. > > > > I think I should clarify that these encodings aren't always a win > depending > > on workload/data shape, but can have a large impact when used > appropriately > > (especially at the "Expression evaluation stage"). Also, any wins don't > > come for free, to exploit encodings properly will add some level of > > complication to existing computation code. > > > > On a packed sparse array representation: > > > >> This would be fine for simple SIMD aggregations like count/avg/mean, but > >> compacting null slots complicates more advanced parallel routines that > >> execute independently and rely on indices aligning with an element's > >> logical position. > > > > > > The main use-case I had in mind here was for scenarios like loading data > > directly parquet (i.e. nulls are already elided) doing some computation > and > > then potentially translating to a dense representation. Similarly it > > appears other have had advantage in some contexts for saving time at > > shuffle [1]. In many cases there is an overlap with RLE, so I'd be open > to > > removing this from the proposal. > > > > > > *On buffer encodings:* > > To paraphrase, the main concern here seems to be it is similar to > metadata > > that was already removed [2]. > > > > A few points on this: > > 1. There was a typo in the original e-mail on sparse-integer set > encoding > > where it said "all" values are either null or not null. This should have > > read "most" values. The elision of buffers is a separate feature. > > 2. I believe these are different then the previous metadata because this > > isn't repetitive information. It provides new information about the > > contents of buffers not available anywhere else. > > 3. The proposal is to create a new message type for the this feature so > it > > wouldn't be bringing back the old code and hopefully would have minimal > > impact on already existing IPC code. > > > > > > *On Compression:* > > So far my take is the consensus is that this can probably be applied at > the > > transport level without being in the spec directly. There might be value > > in more specific types of compression at the buffer level, but we should > > benchmark them first.. > > > > *Data Integrity/Digest:* > > > >>