BTW the closest thing there would be in the spec for achieving this is Dense integers, but unfortunately, those have a 5 byte overhead per slot. So you would need very few integers of 8 bytes and a lot of integers that fit into either 1 or 2 bytes to get any space savings.
On Wed, Mar 31, 2021 at 11:04 AM bobtins <bobti...@gmail.com> wrote: > I appreciate the feedback. I realize it's a tricky nut to crack; there's > always going to be a desire to use compression to improve scaling, and I > was trying to identify a connecting thread between various requests for > compression enhancements on this list and my own experience. I'll look at > the spec again and put it on my back burner. > > On 2021/03/31 04:03:07, Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Bob, > > > > > > > I can observe that in a project like Arrow, there is always a tension > > > between compatibility and extensibility, and it makes me wonder if it > would > > > be helpful to add capabilities without changing the spec. An extension > type > > > can be defined in terms of one of the built-in layouts, but it would > define > > > semantics (such as compression) that would be used to interpret that > layout. > > > > > > I'm not sure if this is referring to existing extension types [1] but I > > believe they are insufficient for this purpose. The compression > > techniques being discussed don't work well, because compression violates > > the fundamental assumptions of the existing protocols. Each array is > > expected to have an equal number of slots. So an array compressed as a > > struct, would cause misalignment with non-encoded arrays. > > > > For example, integers are stored in blocks of 4096 values, with each > block > > > the minimum size to hold all the values. You access the value n with > the > > > expression "block[n >> 12][n % 4096]". > > > Take an example with 1M int32 values. Value 0 is 1e9 but all the others > > > are 0 to 9. > > > Normally you would use 4M bytes to store these, but you could instead > have > > > 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage > for > > > block offsets, so a savings of almost 75%. If you could have uint4 > blocks > > > you could save about 87%. > > > > > > This is difficult with the existing RecordBatch stream approach since > > schema is fixed ahead of time. One could theoretically standardize a > > notion of schema replacement in communications. The I linked takes a > > different approach and allows for adjusting encodings on a per message > > basis. Both are potentially viable. > > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > > On Tue, Mar 30, 2021 at 5:09 PM bobtins <bobti...@gmail.com> wrote: > > > > > From your response, I'm inferring that in order to introduce this kind > of > > > compression, support in the spec is needed, similar to how compression > > > types and parameters are enumerated in > > > https://github.com/apache/arrow/blob/master/format/Message.fbs. Any > > > change in the spec is a Big Deal (and it should be). > > > > > > I can observe that in a project like Arrow, there is always a tension > > > between compatibility and extensibility, and it makes me wonder if it > would > > > be helpful to add capabilities without changing the spec. An extension > type > > > can be defined in terms of one of the built-in layouts, but it would > define > > > semantics (such as compression) that would be used to interpret that > > > layout. > > > > > > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão < > > > > > > jorgecarlei...@gmail.com> wrote: > > > > > > > > > > > > > Would it be an option to use a StructArray for that? One array > > > with the > > > > > > > values, and one with the repetitions: > > > > > > > > > > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) -> > > > > > > > > > > > > > > StructArray([ > > > > > > > "values": Int32([1, 2, 3, 1, 2]), > > > > > > > "repetitions": UInt32([1, 3, 5, 1, 1]), > > > > > > > ]) > > > > > > > > > > > > > > It does not have the same API, but I think that the physical > > > operations > > > > > > > would be different, anyways: ("array + 2" would only operate on > > > > > > "values"). > > > > > > > I think that a small struct / object with some operator > overloading > > > > > would > > > > > > > address this, and writing something on the metadata would allow > > > others > > > > > to > > > > > > > consume it, a-la extension type? > > > > > > > > > > > > > > On a related note, such encoding would address DataFusion's > issue > > > of > > > > > > > representing scalars / constant arrays: a constant array would > be > > > > > > > represented as a repetition. Currently we just unpack (i.e. > > > allocate) a > > > > > > > constant array when we want to transfer through a RecordBatch. > > > > > > > > > > > > > I just reread the whole thread, and realized Jorge was saying a similar > > > thing, that this new type could be built on an existing layout. But I > guess > > > I'm also imagining some more general capabilities: > > > > > > * Define an extension type in terms of an existing layout > > > * Register an extension type implementation > > > * Enumerate available extension types > > > > > > It would make life more complicated, for sure, but it would allow > things > > > like compression to evolve more quickly. I'm thinking about the > in-memory > > > implementation that I built, where I did various things to save memory. > > > > > > For example, integers are stored in blocks of 4096 values, with each > block > > > the minimum size to hold all the values. You access the value n with > the > > > expression "block[n >> 12][n % 4096]". > > > Take an example with 1M int32 values. Value 0 is 1e9 but all the others > > > are 0 to 9. > > > Normally you would use 4M bytes to store these, but you could instead > have > > > 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage > for > > > block offsets, so a savings of almost 75%. If you could have uint4 > blocks > > > you could save about 87%. > > > > > > On the other hand, I had the luxury of total control over the > > > implementation, so it was easy for me to try something and make it > work. > > > For Arrow, having the memory layout standardized makes it easy to > implement > > > various SIMD and GPU optimizations, which all go out the window if you > > > allow arbitrary new data access semantics. > > > > > > So if this idea is hopelessly naive and half-baked and would result in > a > > > train wreck, please feel free to enlighten me/rip me a new one--I'm > here to > > > learn. > > > > > > > > > > > > On 2021/03/27 17:35:32, Wes McKinney <wesmck...@gmail.com> wrote: > > > > I’ve also heard interest from folks in the academic database > community > > > > about adding compressed (sparse) in memory structures to the Arrow > > > format / > > > > specification, so I think it makes more sense to try to figure > things out > > > > at the specification / protocol level and then work on an > > > implementation. I > > > > agree this seems above and beyond what I would think an intern could > > > > accomplish in a 10-12 week period given the process that has been > > > involved > > > > with other significant additions to Arrow over the last several > years. > > > > > > > > On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov <lykov.kir...@gmail.com > > > > > wrote: > > > > > > > > > Thanks for the information and ideas, I need to check them out > > > (especially > > > > > one with structures). > > > > > PR proposal for RLE is very interesting since internally people > express > > > > > interest in this feature. > > > > > For intern, I thought to ask to work primarily on data structures > level > > > > > (like array adapter or something like that). > > > > > So I haven't thought about communication layer, but it is a useful > > > feature > > > > > per se. > > > > > However it might have limited value in terms of contribution to > Arrow > > > and, > > > > > hence, not that attractive for an intern. > > > > > > > > > > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield < > > > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > > > I made a proposal a while ago that covers a form of RLE encoding > > > [1]. I > > > > > > haven't had time to work on it, since it is a substantial effort > to > > > > > > implement. > > > > > > > > > > > > I wouldn't expect an intern to be able to complete the work > > > necessary to > > > > > > get this merged over the course of a normal 3 month internship. > > > > > > > > > > > > [1] https://github.com/apache/arrow/pull/4815/files > > > > > > > > > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão < > > > > > > jorgecarlei...@gmail.com> wrote: > > > > > > > > > > > > > Would it be an option to use a StructArray for that? One array > > > with the > > > > > > > values, and one with the repetitions: > > > > > > > > > > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) -> > > > > > > > > > > > > > > StructArray([ > > > > > > > "values": Int32([1, 2, 3, 1, 2]), > > > > > > > "repetitions": UInt32([1, 3, 5, 1, 1]), > > > > > > > ]) > > > > > > > > > > > > > > It does not have the same API, but I think that the physical > > > operations > > > > > > > would be different, anyways: ("array + 2" would only operate on > > > > > > "values"). > > > > > > > I think that a small struct / object with some operator > overloading > > > > > would > > > > > > > address this, and writing something on the metadata would allow > > > others > > > > > to > > > > > > > consume it, a-la extension type? > > > > > > > > > > > > > > On a related note, such encoding would address DataFusion's > issue > > > of > > > > > > > representing scalars / constant arrays: a constant array would > be > > > > > > > represented as a repetition. Currently we just unpack (i.e. > > > allocate) a > > > > > > > constant array when we want to transfer through a RecordBatch. > > > > > > > > > > > > > > Best, > > > > > > > Jorge > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 25, 2021, 10:03 Kirill Lykov < > lykov.kir...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Thanks for the answer. > > > > > > > > I asked about it because we need it and I was about writing a > > > summer > > > > > > > intern > > > > > > > > proposal for a student to work on it. > > > > > > > > Looks like it could work fine. > > > > > > > > > > > > > > > > On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney < > > > wesmck...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > The SparseTensor stuff is something else entirely (that's > > > matrices > > > > > > > > > where the entries are mostly 0) > > > > > > > > > > > > > > > > > > There isn't anything to help you right now aside from > > > dictionary > > > > > > > > > encoding — if your dictionary has 256 elements or less, you > > > can use > > > > > > > > > uint8 index type and thus have 1 byte per value. We've > > > discussed > > > > > > > > > implementing RLE in the project and so if we do that in the > > > future > > > > > > > > > then a random access data structure could be built on top > of > > > RLE > > > > > (in > > > > > > > > > principle) > > > > > > > > > > > > > > > > > > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera < > > > > > > > niranda.per...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi Lykov, > > > > > > > > > > > > > > > > > > > > I believe there's an arrow sparse tensor abstraction. > > > > > > > > > > > > > > > > > > > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov < > > > lykov.kir...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I wonder if there is an existing way to store > floats/ints > > > with > > > > > > many > > > > > > > > > > > repetitions in some container (not sure about > terminology). > > > > > > > > > > > For example, I might have data like A=[1, 2, 2, 2, 3, > 3, > > > 3, 3, > > > > > 3, > > > > > > > 3, > > > > > > > > > 1, 2] > > > > > > > > > > > and i would like to store only B=[1, 2, 3, 1, 2] but > from > > > user > > > > > > > > > > > perspective it behaves like container A. I know I can > use > > > > > > > dictionary > > > > > > > > > but as > > > > > > > > > > > far I understand it will store internally indices of > the > > > chosen > > > > > > > > > elements so > > > > > > > > > > > it makes sense more for binary data or structures. > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Best regards, > > > > > > > > > > > Kirill Lykov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Kirill Lykov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Kirill Lykov > > > > > > > > > > > > > > >