Hi Evan,

> Hope everyone is staying safe!

Thanks you too.

A fairly substantial amount of CPU is needed for translating from Parquet;
> main memory bandwidth becomes a factor.  Thus, it seems speed and
> constraining factors varies widely by application

I agree performance is going to be application dependent.  As I said, there
are definitely a mix of concerns but at its core Arrow made some
assumptions about constraining factors. Without evidence, I think we should
stick to them.

If the column encodings you are alluding too are really light-weight.  I'm
not sure I see the why they would be compute intensive to decode if they
were included in parquet or another file format stored in memory?

- and having more encodings might extend the use of Arrow to wider scopes
> :)

Maybe. It seems you have something in mind. It might pay to take a step
back, and describe your use-case in a little more depth.  There are still
concerns about keeping reference implementations up-to-date with each
other, which is why I'd like to err on the side of simplicity and
generality.

Perhaps we can have a discussion about what that demonstration would entail?

In my mind it would be showing the benefit of the encodings on standard
public datasets by comparing them with some micro-benchmarks. For instance
some of the operations in the nascent C++ compute component  [1], or
something based on common operations in a standard benchmark suite (e.g.
TPC-*).  If there are standards in the domain you are targeting I think
they would also be useful to consider as well.

Also, would the Arrow community be open to some kind of “third party
> encoding” capability?

I originally had something like this in my proposal, I removed it after
some feedback.  There are justified concerns about this balkanizing the
standard.  Depending on the exact use-case Extension arrays/additional
metadata might be a way of experimenting without adding it to the standard
specifically.

[1]
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/operations

On Tue, Mar 24, 2020 at 2:54 PM Evan Chan <evan_c...@apple.com> wrote:

> Hi Micah,
>
> Hope everyone is staying safe!
>
> > On Mar 16, 2020, at 9:41 PM, Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > I feel a little uncomfortable in the fact that there isn't a more
> clearly defined dividing line for what belongs in Arrow and what doesn't.
> I suppose this is what discussions like these are about :)
> >
> > In my mind Arrow has two primary goals:
> > 1.  Wide accessibility
> > 2.  Speed, primarily with the assumption that CPU time is the
> constraining resource.  Some of the technologies in the Arrow ecosystem
> (e.g. Feather, Flight) have blurred this line a little bit. In some cases
> these technologies are dominated by other constraints as seen by Wes's
> compression proposal [1].
>
> If we only look at the narrow scope of iterating over a single Arrow array
> in memory, perhaps CPU is the dominant constraint (though even there, I
> would expect L1/L2 cache to be fairly important).  Once we expand the scope
> wider and wider…. For example a large volume of data, loading and
> translating data from Parquet and disk, etc. etc., then the factors become
> much more complex.  A fairly substantial amount of CPU is needed for
> translating from Parquet; main memory bandwidth becomes a factor.  Thus, it
> seems speed and constraining factors varies widely by application - and
> having more encodings might extend the use of Arrow to wider scopes  :)
>
> >
> > Given these points, adding complexity and constant factors, at least for
> an initial implementation, are something that needs to be looked at very
> carefully.  I would likely change my mind if there was a demonstration that
> the complexity adds substantial value across a variety of data
> sets/workloads beyond simpler versions.
>
> Perhaps we can have a discussion about what that demonstration would
> entail?
> Also, would the Arrow community be open to some kind of “third party
> encoding” capability?  It would facilitate experimentation by others, in
> real world scenarios, and if those use cases and workloads prove to be
> useful to others, perhaps the community could then consider adopting them
> more widely?
>
> Cheers,
> Evan

Reply via email to