Hi Evan,
> Hope everyone is staying safe! Thanks you too. A fairly substantial amount of CPU is needed for translating from Parquet; > main memory bandwidth becomes a factor. Thus, it seems speed and > constraining factors varies widely by application I agree performance is going to be application dependent. As I said, there are definitely a mix of concerns but at its core Arrow made some assumptions about constraining factors. Without evidence, I think we should stick to them. If the column encodings you are alluding too are really light-weight. I'm not sure I see the why they would be compute intensive to decode if they were included in parquet or another file format stored in memory? - and having more encodings might extend the use of Arrow to wider scopes > :) Maybe. It seems you have something in mind. It might pay to take a step back, and describe your use-case in a little more depth. There are still concerns about keeping reference implementations up-to-date with each other, which is why I'd like to err on the side of simplicity and generality. Perhaps we can have a discussion about what that demonstration would entail? In my mind it would be showing the benefit of the encodings on standard public datasets by comparing them with some micro-benchmarks. For instance some of the operations in the nascent C++ compute component [1], or something based on common operations in a standard benchmark suite (e.g. TPC-*). If there are standards in the domain you are targeting I think they would also be useful to consider as well. Also, would the Arrow community be open to some kind of “third party > encoding” capability? I originally had something like this in my proposal, I removed it after some feedback. There are justified concerns about this balkanizing the standard. Depending on the exact use-case Extension arrays/additional metadata might be a way of experimenting without adding it to the standard specifically. [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/operations On Tue, Mar 24, 2020 at 2:54 PM Evan Chan <evan_c...@apple.com> wrote: > Hi Micah, > > Hope everyone is staying safe! > > > On Mar 16, 2020, at 9:41 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > I feel a little uncomfortable in the fact that there isn't a more > clearly defined dividing line for what belongs in Arrow and what doesn't. > I suppose this is what discussions like these are about :) > > > > In my mind Arrow has two primary goals: > > 1. Wide accessibility > > 2. Speed, primarily with the assumption that CPU time is the > constraining resource. Some of the technologies in the Arrow ecosystem > (e.g. Feather, Flight) have blurred this line a little bit. In some cases > these technologies are dominated by other constraints as seen by Wes's > compression proposal [1]. > > If we only look at the narrow scope of iterating over a single Arrow array > in memory, perhaps CPU is the dominant constraint (though even there, I > would expect L1/L2 cache to be fairly important). Once we expand the scope > wider and wider…. For example a large volume of data, loading and > translating data from Parquet and disk, etc. etc., then the factors become > much more complex. A fairly substantial amount of CPU is needed for > translating from Parquet; main memory bandwidth becomes a factor. Thus, it > seems speed and constraining factors varies widely by application - and > having more encodings might extend the use of Arrow to wider scopes :) > > > > > Given these points, adding complexity and constant factors, at least for > an initial implementation, are something that needs to be looked at very > carefully. I would likely change my mind if there was a demonstration that > the complexity adds substantial value across a variety of data > sets/workloads beyond simpler versions. > > Perhaps we can have a discussion about what that demonstration would > entail? > Also, would the Arrow community be open to some kind of “third party > encoding” capability? It would facilitate experimentation by others, in > real world scenarios, and if those use cases and workloads prove to be > useful to others, perhaps the community could then consider adopting them > more widely? > > Cheers, > Evan