Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-21 Thread Felipe Oliveira Carvalho
I marked the C++ implementation PR ready for review today and will soon be
working on the Go implementation.

https://github.com/apache/arrow/pull/35345

Note that differently from Velox's ArrayVector, the Arrow implementation
(ListView) also features a 64-bit version (LargeListView) to be
symmetrical with the existing List and LargeList types.

--
Felipe

On Tue, Aug 22, 2023 at 12:09 AM Andrew Lamb  wrote:

> I am convinced that the benefits of the ArrayViews as discussed on this
> thread, despite the inconvenience of two similar formats (ListArray and
> ArrayView) and equivalent formats, are enough to add it to the Arrow spec.
>
> It is my opinion we should add ArrowView to the Arrow format, and really
> push forward the notion of composable data management. I think it is a
> compelling vision and one Arrow can very much help to create.
>
> Thank you for your well worded summary Pedro and for everyone else who has
> shared their opinion on this issue.
>
> Andrew
>
> On Thu, Aug 17, 2023 at 11:58 AM Pedro Eugenio Rocha Pedreira
>  wrote:
>
> > Hi all,
> >
> > Getting back to this thread as I realize there were a few unanswered
> > questions. Adding a bit more context on the rationale and usage of
> > ArrayViews in Velox, and the importance to standardize it:
> >
> > re: Why do we need it?
> >
> > We use ArrayViews for two main reasons. First, for efficient execution of
> > conditionals (IF/SWITCHs) that generate list types (array/maps). In a
> > vectorized engine, there are two ways to execute conditionals. One will
> > always first generate the condition bitmap that decides which branch each
> > row will take, otherwise you would pay the dispatch cost for each row.
> Once
> > the bitmap is generated, without ArrayView, one would write each branch
> to
> > its own buffer output (one containing the "then" rows, and one with the
> > "else" rows), then at the end conduct an additional step to merge these
> two
> > into a third output buffer, taking the condition bitmap into account.
> This
> > is analogous to a "scatter" operation.
> >
> > With ArrayViews one can write them to their final location with a single
> > pass over each branch, removing the need for the final step. This is
> always
> > faster as it removes the need for a last step. As raised before, the
> actual
> > perf improvement will depend on the data types, their average sizes, cost
> > of evaluating conditions, branches, buffer sizes and many other factors.
> > One can play with these factors and tweak a microbenchmark to make it
> more
> > of more or less evident (as convenient), but the fact is that it will
> > always be faster. We have found extensive empirical data in real
> workloads
> > to support these claims. We have found that it is common to find
> workloads
> > with large nested conditional statements rearranging maps for feature
> > engineering workloads. Often, they happen right after table scans, so the
> > additional copies and cache misses can really add up.
> >
> > The second reason is convenience for slicing array/maps. One can think of
> > array_slice(), array_trim(), and similar functions (and their map
> > counterparts). With ArrayViews as output they can be implemented
> zero-copy
> > by only adjusting the offsets and lengths, without ever touching (or even
> > having to decode) the base elements. This is also faster on every case.
> >
> > For these reasons, I would argue that the ArrayView format is the
> > state-of-the-art representation for any modern vectorized query engine
> that
> > really cares about performance (like Velox, DuckDB, and similar). If
> > there's no Arrow counterpart for it, they (we) will continue diverging
> from
> > Arrow, and we might end up missing the mark of standardization.
> >
> > re: Where do we need it?
> >
> > An interesting point raised is that this could be seen as an
> > implementation detail inside these engines, and that we might not need it
> > at the engine boundary. While this is a fair way of looking at it, we've
> > found two use cases where this presents a hurdle:
> >
> > First, in the Gluten project that enables Velox to be integrated within
> > Spark, for example, the communication between Velox and Spark java is
> done
> > via JNI. In this case, the very data coming from a table scan +
> projection,
> > which in many cases will be represented using an ArrayView in Velox,
> needs
> > to be efficiently shipped to the Java-land. Without ArrayView support,
> this
> > communication will not be done via Arrow as we would rather specialize
> than
> > incur in an extra copy. The same for other bridges between systems, like
> a
> > DuckDB<->Velox present in Velox, which for the same reason (in addition
> to
> > historical lack of StringView) is not built using Arrow today.
> >
> > The second, as mentioned by Will, is UDF support. For these use cases, it
> > would be very convenient to support UDF packages in different languages
> > written based on the Arrow format. However, that 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-21 Thread Andrew Lamb
I am convinced that the benefits of the ArrayViews as discussed on this
thread, despite the inconvenience of two similar formats (ListArray and
ArrayView) and equivalent formats, are enough to add it to the Arrow spec.

It is my opinion we should add ArrowView to the Arrow format, and really
push forward the notion of composable data management. I think it is a
compelling vision and one Arrow can very much help to create.

Thank you for your well worded summary Pedro and for everyone else who has
shared their opinion on this issue.

Andrew

On Thu, Aug 17, 2023 at 11:58 AM Pedro Eugenio Rocha Pedreira
 wrote:

> Hi all,
>
> Getting back to this thread as I realize there were a few unanswered
> questions. Adding a bit more context on the rationale and usage of
> ArrayViews in Velox, and the importance to standardize it:
>
> re: Why do we need it?
>
> We use ArrayViews for two main reasons. First, for efficient execution of
> conditionals (IF/SWITCHs) that generate list types (array/maps). In a
> vectorized engine, there are two ways to execute conditionals. One will
> always first generate the condition bitmap that decides which branch each
> row will take, otherwise you would pay the dispatch cost for each row. Once
> the bitmap is generated, without ArrayView, one would write each branch to
> its own buffer output (one containing the "then" rows, and one with the
> "else" rows), then at the end conduct an additional step to merge these two
> into a third output buffer, taking the condition bitmap into account. This
> is analogous to a "scatter" operation.
>
> With ArrayViews one can write them to their final location with a single
> pass over each branch, removing the need for the final step. This is always
> faster as it removes the need for a last step. As raised before, the actual
> perf improvement will depend on the data types, their average sizes, cost
> of evaluating conditions, branches, buffer sizes and many other factors.
> One can play with these factors and tweak a microbenchmark to make it more
> of more or less evident (as convenient), but the fact is that it will
> always be faster. We have found extensive empirical data in real workloads
> to support these claims. We have found that it is common to find workloads
> with large nested conditional statements rearranging maps for feature
> engineering workloads. Often, they happen right after table scans, so the
> additional copies and cache misses can really add up.
>
> The second reason is convenience for slicing array/maps. One can think of
> array_slice(), array_trim(), and similar functions (and their map
> counterparts). With ArrayViews as output they can be implemented zero-copy
> by only adjusting the offsets and lengths, without ever touching (or even
> having to decode) the base elements. This is also faster on every case.
>
> For these reasons, I would argue that the ArrayView format is the
> state-of-the-art representation for any modern vectorized query engine that
> really cares about performance (like Velox, DuckDB, and similar). If
> there's no Arrow counterpart for it, they (we) will continue diverging from
> Arrow, and we might end up missing the mark of standardization.
>
> re: Where do we need it?
>
> An interesting point raised is that this could be seen as an
> implementation detail inside these engines, and that we might not need it
> at the engine boundary. While this is a fair way of looking at it, we've
> found two use cases where this presents a hurdle:
>
> First, in the Gluten project that enables Velox to be integrated within
> Spark, for example, the communication between Velox and Spark java is done
> via JNI. In this case, the very data coming from a table scan + projection,
> which in many cases will be represented using an ArrayView in Velox, needs
> to be efficiently shipped to the Java-land. Without ArrayView support, this
> communication will not be done via Arrow as we would rather specialize than
> incur in an extra copy. The same for other bridges between systems, like a
> DuckDB<->Velox present in Velox, which for the same reason (in addition to
> historical lack of StringView) is not built using Arrow today.
>
> The second, as mentioned by Will, is UDF support. For these use cases, it
> would be very convenient to support UDF packages in different languages
> written based on the Arrow format. However, that means again that data
> coming from a Velox plan could be using the ArrayView format. Once more,
> requiring an extra copy on that step will end up pushing developers to
> build a more specialized API instead.
>
> In general, I recognize the need for stability and simplicity in the
> format, but wonder if inflexibility given the evolving state-of-the-art
> could lead to further divergence and obsolescence. Weston's proposal of
> canonical alternative layouts seems like a fine way to avoid this pitfall,
> IMHO.
>
> re: Which libraries need it?
>
> One last point to be made is that though it is 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-17 Thread Pedro Eugenio Rocha Pedreira
Hi all,

Getting back to this thread as I realize there were a few unanswered questions. 
Adding a bit more context on the rationale and usage of ArrayViews in Velox, 
and the importance to standardize it:

re: Why do we need it?

We use ArrayViews for two main reasons. First, for efficient execution of 
conditionals (IF/SWITCHs) that generate list types (array/maps). In a 
vectorized engine, there are two ways to execute conditionals. One will always 
first generate the condition bitmap that decides which branch each row will 
take, otherwise you would pay the dispatch cost for each row. Once the bitmap 
is generated, without ArrayView, one would write each branch to its own buffer 
output (one containing the "then" rows, and one with the "else" rows), then at 
the end conduct an additional step to merge these two into a third output 
buffer, taking the condition bitmap into account. This is analogous to a 
"scatter" operation.

With ArrayViews one can write them to their final location with a single pass 
over each branch, removing the need for the final step. This is always faster 
as it removes the need for a last step. As raised before, the actual perf 
improvement will depend on the data types, their average sizes, cost of 
evaluating conditions, branches, buffer sizes and many other factors. One can 
play with these factors and tweak a microbenchmark to make it more of more or 
less evident (as convenient), but the fact is that it will always be faster. We 
have found extensive empirical data in real workloads to support these claims. 
We have found that it is common to find workloads with large nested conditional 
statements rearranging maps for feature engineering workloads. Often, they 
happen right after table scans, so the additional copies and cache misses can 
really add up.

The second reason is convenience for slicing array/maps. One can think of 
array_slice(), array_trim(), and similar functions (and their map 
counterparts). With ArrayViews as output they can be implemented zero-copy by 
only adjusting the offsets and lengths, without ever touching (or even having 
to decode) the base elements. This is also faster on every case.

For these reasons, I would argue that the ArrayView format is the 
state-of-the-art representation for any modern vectorized query engine that 
really cares about performance (like Velox, DuckDB, and similar). If there's no 
Arrow counterpart for it, they (we) will continue diverging from Arrow, and we 
might end up missing the mark of standardization.

re: Where do we need it?

An interesting point raised is that this could be seen as an implementation 
detail inside these engines, and that we might not need it at the engine 
boundary. While this is a fair way of looking at it, we've found two use cases 
where this presents a hurdle:

First, in the Gluten project that enables Velox to be integrated within Spark, 
for example, the communication between Velox and Spark java is done via JNI. In 
this case, the very data coming from a table scan + projection, which in many 
cases will be represented using an ArrayView in Velox, needs to be efficiently 
shipped to the Java-land. Without ArrayView support, this communication will 
not be done via Arrow as we would rather specialize than incur in an extra 
copy. The same for other bridges between systems, like a DuckDB<->Velox present 
in Velox, which for the same reason (in addition to historical lack of 
StringView) is not built using Arrow today.

The second, as mentioned by Will, is UDF support. For these use cases, it would 
be very convenient to support UDF packages in different languages written based 
on the Arrow format. However, that means again that data coming from a Velox 
plan could be using the ArrayView format. Once more, requiring an extra copy on 
that step will end up pushing developers to build a more specialized API 
instead.

In general, I recognize the need for stability and simplicity in the format, 
but wonder if inflexibility given the evolving state-of-the-art could lead to 
further divergence and obsolescence. Weston's proposal of canonical alternative 
layouts seems like a fine way to avoid this pitfall, IMHO.

re: Which libraries need it?

One last point to be made is that though it is useful to have ArrayView support 
in the Arrow C++ library, most of the systems I described do not use it and 
rather have their own implementation. The most important piece, in our view, is 
the unified and agreed upon memory layout, and a way to push this through the 
stable C ABI.

re: why not dictionaries?

The suggested dictionary over variable-sized arrays approach can be used 
out-of-the-box and indeed gives you some flexibility, but there are two 
shortcomings: (a) like Felipe pointed out, it incurs in an extra indirection 
(cache miss) in order to fetch the array, (b) it still doesn't provide the same 
flexibility to slice the array/map (you can reorder arrays, but not slice them).

I hope this 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Jacob Wujciak-Jens
> Even if ListView is rarely used for interoperability (if it never gains
wide adoption), some of the arrow implementations could use ListView to
offer faster computation kernels, which I think has real value

This is an important point, thanks for the clear phrasing Andrew!

On Thu, Jun 15, 2023 at 5:32 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies
>  wrote:
>
> > Even something relatively straightforward becomes a huge implementation
> > effort when multiplied by a large number of codebases, users and
> > datasets. Parquet is a great source of historical examples of the
> > challenges of incremental changes that don't meaningfully unlock new
> > use-cases. To take just one, Int96 was deprecated almost a decade ago,
> > in favour of some additional metadata over an existing physical layout,
> > and yet Int96 is still to the best of my knowledge used by Spark by
> > default.
> >
> > That's not to say that I think the arrow specification should ossify and
> > we should never change it, but I'm not hugely enthusiastic about adding
> > encodings that are only incremental improvements over existing encodings.
> >
> > I therefore wonder if there are some new use-cases I am missing that
> > would be unlocked by this change, and that wouldn't be supported by the
> > dictionary proposal? Perhaps you could elaborate here?
>
>
> The dict-encoded ListArray would add a memory indirection: the slot in the
> dictionary has to be resolved before the offset can be. With ListView, the
> CPU can access offset and size without waiting on that memory access. Both
> need two buffer accesses after the validity check, but ListView doesn't
> have that dependency.
>
>
> > Whilst I do agree
> > using dictionaries as proposed is perhaps a less elegant solution, I
> > don't see anything inherently wrong with it, and if it ain't broke we
> > really shouldn't be trying to fix it.
> >
> > Kind Regards,
> >
> > Raphael Taylor-Davies
> >
> > On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho
> >  wrote:
> >
> > General approach to alternative formats aside, in the specific case
> > of ListView, I think the implementation complexity is being
> > overestimated in these discussions. The C++ Arrow implementation
> > shares a lot of code between List and LargeList. And with some
> > tweaks, I'm able to share that common infrastructure for ListView as
> > well. [1] ListView is similar to list: it doesn't require offsets to
> > be sorted and adds an extra buffer containing sizes. For symmetry
> > with the List and LargeList types (FixedSizeList not included), I'm
> > going to propose we add a LargeListView. That is not part of the
> > draft implementation yet, but seems like an obvious thing to have
> > now that I implemented the `if_else` specialization. [2] David Li
> > asked about this above and I can confirm now that 64-bit version of
> > ListView (LargeListView) is in the plans. Trying to avoid
> > re-implementing some kernels is not a good goal to chase, IMO,
> > because kernels need tweaks to take advantage of the format. [1]
> > https://github.com/apache/arrow/pull/35345 [2]
> > https://github.com/felipecrv/arrow/commits/list_view_backup --
> > Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace
> >  wrote:
> >
> > perhaps we could support this use-case as a canonical
> > extension type over dictionary encoded, variable-sized arrays
> >
> > I believe this suggestion is valid and could be used to solve
> > the if-else case. The algorithm, if I understand it, would be
> > roughly: ``` // Note: Simple pseudocode, vectorization left as
> > exercise for the reader auto indices_builder = ... auto
> > list_builder = ... indices_builder.resize(batch.length); Array
> > condition_mask = condition.EvaluateBatch(batch); for row_index
> > in selected_rows(condition_mask): indices_builder[row_index] =
> > list_builder.CurrentLength();
> > list_builder.Append(if_expr.EvaluateRow(batch, row_index)) for
> > row_index in unselected_rows(condition_mask):
> > indices_builder[row_index] = list_builder.CurrentLength();
> > list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> > return DictionaryArray(indices_builder.Finish(),
> > list_builder.Finish()) ``` I also agree this is a slightly
> > awkward use of dictionaries (e.g. the dictionary would have the
> > same size as the # of indices) and perhaps not the most
> > intuitive way to solve the problem. My gut reaction is simply
> > "an improved if/else kernel is not, alone, enough justification
> > for a new layout" and yet... I think we are seeing the start (I
> > hope) of a trend where Arrow is not just used "between systems"
> > (e.g. to shuttle data from one 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Felipe Oliveira Carvalho
On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies
 wrote:

> Even something relatively straightforward becomes a huge implementation
> effort when multiplied by a large number of codebases, users and
> datasets. Parquet is a great source of historical examples of the
> challenges of incremental changes that don't meaningfully unlock new
> use-cases. To take just one, Int96 was deprecated almost a decade ago,
> in favour of some additional metadata over an existing physical layout,
> and yet Int96 is still to the best of my knowledge used by Spark by
> default.
>
> That's not to say that I think the arrow specification should ossify and
> we should never change it, but I'm not hugely enthusiastic about adding
> encodings that are only incremental improvements over existing encodings.
>
> I therefore wonder if there are some new use-cases I am missing that
> would be unlocked by this change, and that wouldn't be supported by the
> dictionary proposal? Perhaps you could elaborate here?


The dict-encoded ListArray would add a memory indirection: the slot in the
dictionary has to be resolved before the offset can be. With ListView, the
CPU can access offset and size without waiting on that memory access. Both
need two buffer accesses after the validity check, but ListView doesn't
have that dependency.


> Whilst I do agree
> using dictionaries as proposed is perhaps a less elegant solution, I
> don't see anything inherently wrong with it, and if it ain't broke we
> really shouldn't be trying to fix it.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho
>  wrote:
>
> General approach to alternative formats aside, in the specific case
> of ListView, I think the implementation complexity is being
> overestimated in these discussions. The C++ Arrow implementation
> shares a lot of code between List and LargeList. And with some
> tweaks, I'm able to share that common infrastructure for ListView as
> well. [1] ListView is similar to list: it doesn't require offsets to
> be sorted and adds an extra buffer containing sizes. For symmetry
> with the List and LargeList types (FixedSizeList not included), I'm
> going to propose we add a LargeListView. That is not part of the
> draft implementation yet, but seems like an obvious thing to have
> now that I implemented the `if_else` specialization. [2] David Li
> asked about this above and I can confirm now that 64-bit version of
> ListView (LargeListView) is in the plans. Trying to avoid
> re-implementing some kernels is not a good goal to chase, IMO,
> because kernels need tweaks to take advantage of the format. [1]
> https://github.com/apache/arrow/pull/35345 [2]
> https://github.com/felipecrv/arrow/commits/list_view_backup --
> Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace
>  wrote:
>
> perhaps we could support this use-case as a canonical
> extension type over dictionary encoded, variable-sized arrays
>
> I believe this suggestion is valid and could be used to solve
> the if-else case. The algorithm, if I understand it, would be
> roughly: ``` // Note: Simple pseudocode, vectorization left as
> exercise for the reader auto indices_builder = ... auto
> list_builder = ... indices_builder.resize(batch.length); Array
> condition_mask = condition.EvaluateBatch(batch); for row_index
> in selected_rows(condition_mask): indices_builder[row_index] =
> list_builder.CurrentLength();
> list_builder.Append(if_expr.EvaluateRow(batch, row_index)) for
> row_index in unselected_rows(condition_mask):
> indices_builder[row_index] = list_builder.CurrentLength();
> list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> return DictionaryArray(indices_builder.Finish(),
> list_builder.Finish()) ``` I also agree this is a slightly
> awkward use of dictionaries (e.g. the dictionary would have the
> same size as the # of indices) and perhaps not the most
> intuitive way to solve the problem. My gut reaction is simply
> "an improved if/else kernel is not, alone, enough justification
> for a new layout" and yet... I think we are seeing the start (I
> hope) of a trend where Arrow is not just used "between systems"
> (e.g. to shuttle data from one place to another, or between a
> query engine and a visualization tool) but also "within systems"
> (e.g. UDFs, bespoke file formats and temporary tables, between
> workers in a distributed query engine). When arrow is used
> "within systems" I think both the number of bespoke formats and
> the significance of conversion cost increases. For example, it's
> easy to say that Velox should convert at the boundary as data
> leaves Velox. But what if Velox (or datafusion or 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Andrew Lamb
gt; deprecating"
> > > > > >>>>> the
> > > > > >>>>>>> old
> > > > > >>>>>>>> list type: suggesting new implementations prefer the list
> > > > > >> view, but
> > > > > >>>>>>>> reassuring that implementations should support the old
> > format,
> > > > > >> even
> > > > > >>>>> if
> > > > > >>>>>>> they
> > > > > >>>>>>>> just convert to the new format. To be clear, this wouldn't
> > > > > >> mean we
> > > > > >>>>> plan
> > > > > >>>>>>> to
> > > > > >>>>>>>> create breaking changes in the format; but if we ever did
> > for
> > > > > >> other
> > > > > >>>>>>>> reasons, the old list type might go.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Arrow compute libraries could choose either format for
> > compute
> > > > > >>>>> support,
> > > > > >>>>>>> and
> > > > > >>>>>>>> plan to do conversion at the boundaries. Libraries that
> use
> > > > > >> the new
> > > > > >>>>>> type
> > > > > >>>>>>>> will have cheap conversion when reading the old type.
> > > Meanwhile
> > > > > >>>> those
> > > > > >>>>>>> that
> > > > > >>>>>>>> are still on the old type will have some incentive to move
> > > > > >> towards
> > > > > >>>>> the
> > > > > >>>>>>> new
> > > > > >>>>>>>> one, since that conversion will not be as efficient.
> > > > > >>>>>>>>
> > > > > >>>>>>>> [1]
> > > > > >>>>
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > > >>>>>>>> [2]
> > > > > >>>>
> > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > > > > >>>> will.jones...@gmail.com>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hello,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages
> > of
> > > > > >> this
> > > > > >>>>>> format
> > > > > >>>>>>>>> seem to be primarily about query processing. Other
> > encodings
> > > > > >> like
> > > > > >>>>> REE
> > > > > >>>>>>> and
> > > > > >>>>>>>>> dictionary have space-saving advantages that justify them
> > > > > >> simply
> > > > > >>>> in
> > > > > >>>>>>> terms
> > > > > >>>>>>>>> of space efficiency (although they have query processing
> > > > > >>>> advantages
> > > > > >>>>> as
> > > > > >>>>>>>>> well). As discussed, most use cases are already well
> served
> > > by
> > > > > >>>>>> existing
> > > > > >>>>>>>>> list types and dictionary encoding.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I agree that there are cases where transferring this type
> > > > > >> without
> > > > > >>>>>>>>> conversion would be ideal. One use case I can think of is
> > if
> > > > > >> Velox
> > > > > >>>>>>> wants to
> > > > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
t; >>>>>>>>> well). As discussed, most use cases are already well served
> > by
> > > > >>>>>> existing
> > > > >>>>>>>>> list types and dictionary encoding.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I agree that there are cases where transferring this type
> > > > >> without
> > > > >>>>>>>>> conversion would be ideal. One use case I can think of is
> if
> > > > >> Velox
> > > > >>>>>>> wants to
> > > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > > > >> PyArrow,
> > > > >>>> for
> > > > >>>>>>>>> example) that operate on this column type and therefore
> wants
> > > > >>>>>> zero-copy
> > > > >>>>>>>>> exchange over the C Data Interface.
> > > > >>>>>>>>>
> > > > >>>>>>>>> One big question I have: we already have three list types:
> > > > >> list,
> > > > >>>>> large
> > > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> > > > >> will
> > > > >>>> only
> > > > >>>>>>> want a
> > > > >>>>>>>>> view version of the 32-bit offset variable length list? Or
> > > > >> are we
> > > > >>>>>>>>> potentially talking about view variants for all three?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Will Jones
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > > > >>>>>>>>> felipe...@gmail.com> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > > > >>>>>>> non-deterministic
> > > > >>>>>>>>>> order writes is unlocked by the transport and processing
> of
> > > > >> the
> > > > >>>>> data
> > > > >>>>>>> being
> > > > >>>>>>>>>> agnostic to the physical order as much as possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But
> it
> > > > >> can
> > > > >>>>> be a
> > > > >>>>>>>>>> provisory step for compatibility between systems that
> don’t
> > > > >>>>>> understand
> > > > >>>>>>> the
> > > > >>>>>>>>>> format yet. This is similar to the situation with
> > compression
> > > > >>>>> schemes
> > > > >>>>>>> like
> > > > >>>>>>>>>> run-end encoding — the goal is processing the compressed
> > data
> > > > >>>>>> directly
> > > > >>>>>>>>>> without an expansion step whenever possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> This is why having it as part of the open Arrow format is
> so
> > > > >>>>>> important:
> > > > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> > > > >> and/or
> > > > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > > > >>>>> incompatible
> > > > >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> > > > >> step
> > > > >>>>>> between
> > > > >>>>>>>>>> the
> > > > >>>>>>&g

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
ttps://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >>
> >>
> >> [2]
> >>
> >> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> >>
> >>
> >> On Sun, May 21, 2023 at
> >> 3:07 PM Will Jones <
> >>
> >> will.jones...@gmail.com>
> >>
> >> wrote:
> >>
> >> Hello, I think Sasha
> >> brings up a good point,
> >> that the advantages of
> >>
> >> this
> >>
> >> format
> >>
> >> seem to be primarily
> >> about query processing.
> >> Other encodings
> >>
> >> like
> >>
> >> REE
> >>
> >> and
> >>
> >> dictionary have
> >> space-saving advantages
> >> that justify them
> >>
> >> simply
> >>
> >> in
> >>
> >> terms
> >>
> >> of space efficiency
> >> (although they have
> >> query processing
> >>
> >> advantages
> >>
> >> as
> >>
> >> well). As discussed,
> >> most use cases are
> >> already well served by
> >>
> >> existing
> >>
> >> list types and
> >> dictionary encoding. I
> >> agree that there are
> >> cases where transferring
> >> this type
> >>
> >> without
> >>
> >> conversion would be
> >> ideal. One use case I
> >> can think of is if
> >>
> >> Velox
> >>
> >> wants to
> >>
> >> be able to take
> >> Arrow-based UDFs
> >> (possibly written with
> >>
> >> PyArrow,
> >>
> >> for
> >>
> >> example) that operate on
> >> this column type and
> >> therefore wants
> >>
> >> zero-copy
> >>
> >> exchange over the C Data
> >> Interface. One big
> >> question I have: we
> >> already have three list
> >> types:
> >>
> >> list,
> >>
> >> large
> >>
> >> list (64-bit offsets),
> >> and fixed size list. Do
> >> we think we
> >>
> >> will
> >>
> >> only
> >>
> >> want a
> >>
> >> view version of the
> >> 32-bit offset variable
> >> length list? Or
> >>
> >> are we
> >>
> >> potentially talking
> >> about view variants for
> >> all three? Best, Will
> >> Jones On Sun, May 21,
> >> 2023 at 2:19 PM Felipe
> >> Oliveira Carvalho <
> >> felipe...@gmail.com> wrote:
> >>
> >> The benefit of
> >> having a memory
> >> format that’s
> >> friendly to
> >>
> >> non-deterministic
> >>
> >> order writes is
> >> unlocked by the
> >> transport and
> >> processing of
> >>
> >> the
> >>
> >> data
> >>
> >> being
> >>
> >> agnostic to the
> >> physical order as
> >> much as possible.
> >> Requiring a
> >> conversion could
> >> cancel out that
> >> benefit. But it
> >>
> >> can
> >>
> >> be a
> >>
> >> provisory step for
> >> compatibility
> >> between systems that
> >> don’t
> >>
> >> understand
> >>
> >> the
> >>
> >> format yet. This is
> >> similar to the
> >> situation with
> >> compression
> >>
> >> schemes
> >>
> >> like
> >>
> >> run-end encoding —
> >> the goal is
> >> processing the
> >> compressed data
> >>
> >> directly
> >>
> >> without an expansion
> >> step whenever
> >> possible. This is
> >> why having it as
> >> part of the open
> >> Arrow format is so
> >>
> >> important:
> >>
> >> everyone can agree
> >> on a format that’s
> >> friendly to parallel
> >>
> >> and/or
> >>
> >> vectorized compute
> >> kernels without
> >> introducing multiple
> >>
> >> incompatible
> >>
> >> formats to the
> >> ecosystem and
> >> without imposing a
> >> conversion
> >>
> >> step
> >>
> >> between

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
executing

conditionals
like
IF/SWITCH
statements,
which are

pervasive

among

our

workloads.
To fully
vectorize
it, one
first
evaluates
the

expression,

then

generate
a bitmap
containing
which
rows
take the
THEN and

which

take

the

ELSE
branch.
Then you
populate
all rows
that
match the

first

branch

by

evaluating
the THEN
expression
in a
vectorized

(branch-less

and

cache

friendly)
way, and
subsequently
the ELSE
branch.
If you

can’t

write

them

out-of-order,
you
would
either
have a
big
branch
per row

dispatching

to

the

right
expression
(slow),
or
populate
two
distinct
vectors

then

merging

them

at the
end
(potentially
even
slower).
How much
faster our

approach

is

highly
depends
on the
buffer
sizes
and
expressions,
but we

found

it

to

be

faster
enough
on
average
to
justify
us
extending
the

underlying

layout.

With
that
said,
this is
all
within a
single
thread of

execution.

Parallelization
is done
by
feeding
each
thread
its own

vector/data.

As

pointed
out in a
previous
message,
this
also
gives
you the

flexibility

to

implement
cardinality
increasing/reducing
operations,
but

we

don’t

use

it

for that
purpose.
Operations
like
filtering,
joining,

unnesting

and

similar

are done
by
wrapping
the
internal
vector
in a
dictionary,

as

these

need

to

work not
only on
“ListViews”
but on
any data
types with

any

encoding.

There

are more
details
on
Section
4.2.1 in
[1]
Beyond
this, it
also
gives
function/kernel
developers
more

flexibility

to

implement
operations
that
manipulate
Arrays/Maps.
For

example,

operations

that
slice
these
containers
can be
implemented
in a

zero-copy

manner

by

just
rearranging
the
lengths/offsets
indices,
without
ever

touching

the

larger
internal
buffers.
This is
a
similar
motivation
as

for

StringView

(think
of
substr(),
trim(),
and
similar).
One nice
last

property

is

that

this
layout
allows
for
overlapping
ranges.
This is

something

discussed

with

our ML
people
to allow
deduping
feature
values
in a tensor

(which

is

fairly

common),
but not
something
we have
leveraged
just
yet. [1]
-
https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
Best, --
Pedro
Pedreira
------------
From:
Felipe
Oliveira
Carvalho

Sent:
Friday,
May 19,
2023
10:01 AM
To:
dev@arrow.apache.org

Cc:
Pedro
Eugenio
Rocha
Pedreira

Subject:
Re:
[DISCUSS][Format]
Starting
the draft

implementation

of

the

ArrayView
array
format
+pedroerp
On Thu,
11 May
2023 at
17: 51
Raphael

Taylor-Davies


wrote:
Hi All, >

if

we

added

this, do
we think
many
Arrow
and
query >
engine

implementations

(for

example,
DataFusion)
will be
ZjQcmQRYFpfptBannerStart
This
Message
Is From
an
External
Sender
ZjQcmQRYFpfptBannerEnd
+pedroerp
On Thu,
11 May
2023 at
17:51
Raphael
Taylor-Davies

wrote:
Hi All,

if
we
added
this,
do
we
think
many
Arrow
and
query
engine
implementations
(for
example,
DataFusion)
will be

eager

to

add

full

support
for
the
type,
including
compute
kernels?
Or are

they

likely

to

just

convert
this
type
to
ListArray
at
import
boundaries?


I can't
speak
for
query
engines
in
general,
but at
least

for

arrow-rs

and by
extension
DataFusion,
and
based on
my current

understanding

of

the
use-cases
I would
be
rather
hesitant
to add
support

to the

kernels

for this
array
type,
definitely
instead
favouring

conversion

at

the

edges.
We
already
have
issues
with the
amount
of code

generation

resulting
in
binary
bloat
and long
compile
times,
and I

worry

this

would

worsen
this
situation
whilst
not
really
providing

compelling

advantages

for the
vast
majority
of
workloads
that
don't
interact

with

Velox.

Whilst I
can
definitely
see that
the
ListView

representation

is

probably

a better
way to
represent
variable
length
lists
than what

arrow

settled

upon,
I'm not
yet
convinced
it is
sufficiently
better to

incentivise

broad
ecosystem
adoption.
Kind
Regards,
Raphael
Taylor-Davies


On
11/05/2023
21:20,
Will
Jones
wrote:
Hi
Felipe,
Thanks
for
the
additional
details.


Velox
kernels
benefit
from
being
able
to
append
data
to

the

array

from

different
threads
without
care
for
strict
ordering.


Only the

offsets

array

has
to
be
written
according
to
logical
order
but
that
is

potentially a

much

smaller
buffer
than
the
values
buffer.


It
still
seems
to
me
like
applications
are
still
pretty

niche,

as I

suspect

in
most
cases
the
benefits
are
outweighed
by
the
costs.

The

benefit

here

seems
pretty
limited:
if
you
are
trying
to
split
work

between

threads,

usually
you
will
have
other
levels
such
as
array
chunks

to

parallelize.

And

if
you
have
an
incoming
stream
of
row
data,
you'll
want

to

append

in

predictable
order
to
match
the
order
of
the
other

arrays. Am

I

missing

something?
And,
IIUC,
the
cost
of
using
ListView
with
out-of-order


values

over

ListArray
is
you
lose
memory
locality;
the
values
of

element

2

are

no

longer
adjacent
to
the
values
of
el

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
ng, but would it be fair to say that

this

is

an

implementation detail of the query engine? Query engines

don’t

necessarily

need to conform to the Arrow format internally, only at

ingest/egress

points, and performing a conversion from the non-view to

view

format

seems

like it would be very cheap (though I understand not

necessarily

the

other

way around, but you’d need to do that anyway if you’re

serializing).


Sasha Krassovsky


20 мая 2023 г., в 13:00, Will Jones <

will.jones...@gmail.com>

написал(а):


Thanks for sharing these details, Pedro. The conditional

branches

argument

makes a lot of sense to me.

The tensors point brings up some interesting issues. For

now,

we've

defined

our only tensor extension type to be built on a fixed size

list.

If a

use

case of this might be manipulating tensors with zero copy,

perhaps

that

suggests that we want a fixed size list variant? In

addition,

would

we

have

to define another extension type to be a ListView variant?

Or

would

we

want

to think about making extension types somehow valid across

various

encodings of the same "logical type"?


On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha

Pedreira

 wrote:

Hi all,

This is Pedro from the Velox team at Meta. This is my

first

time

here,

so

nice to e-meet you!

Adding to what Felipe said, the main reason we created

“ListView”

(though

we just call them ArrayVector/MapVector in Velox) is

that,

along

with

StringViews for strings, they allow us to write any

columnar

buffer

out-or-order, regardless of their types or encodings.

This is

naturally

doable for all primitive types (fixed-size), but not for

types

that

don’t

have fixed size and are required to be contiguous. The

StringView

and

ListView formats allow us to keep this invariant in

Velox.


Being able to write vectors out-of-order is useful when

executing

conditionals like IF/SWITCH statements, which are

pervasive

among

our

workloads. To fully vectorize it, one first evaluates the

expression,

then

generate a bitmap containing which rows take the THEN and

which

take

the

ELSE branch. Then you populate all rows that match the

first

branch

by

evaluating the THEN expression in a vectorized

(branch-less

and

cache

friendly) way, and subsequently the ELSE branch. If you

can’t

write

them

out-of-order, you would either have a big branch per row

dispatching

to

the

right expression (slow), or populate two distinct vectors

then

merging

them

at the end (potentially even slower). How much faster our

approach

is

highly depends on the buffer sizes and expressions, but

we

found

it

to

be

faster enough on average to justify us extending the

underlying

layout.


With that said, this is all within a single thread of

execution.

Parallelization is done by feeding each thread its own

vector/data.

As

pointed out in a previous message, this also gives you

the

flexibility

to

implement cardinality increasing/reducing operations, but

we

don’t

use

it

for that purpose. Operations like filtering, joining,

unnesting

and

similar

are done by wrapping the internal vector in a dictionary,

as

these

need

to

work not only on “ListViews” but on any data types with

any

encoding.

There

are more details on Section 4.2.1 in [1]

Beyond this, it also gives function/kernel developers

more

flexibility

to

implement operations that manipulate Arrays/Maps. For

example,

operations

that slice these containers can be implemented in a

zero-copy

manner

by

just rearranging the lengths/offsets indices, without

ever

touching

the

larger internal buffers. This is a similar motivation as

for

StringView

(think of substr(), trim(), and similar). One nice last

property

is

that

this layout allows for overlapping ranges. This is

something

discussed

with

our ML people to allow deduping feature values in a

tensor

(which

is

fairly

common), but not something we have leveraged just yet.

[1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

Best,
--
Pedro Pedreira
____________
From: Felipe Oliveira Carvalho 
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org 
Cc: Pedro Eugenio Rocha Pedreira 
Subject: Re: [DISCUSS][Format] Starting the draft

implementation

of

the

ArrayView array format

+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael

Taylor-Davies


taylordavies@ googlemail. com. invalid> wrote: Hi All, >

if

we

added

this, do we think many Arrow and query > engine

implementations

(for

example, DataFusion) will be
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
 wrote:
Hi All,


if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be

eager

to

add

full

support for the type, including compute kernels? Or are

they

likely

to

just

convert thi

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Andrew Lamb
; > > >>>>>>>>
> > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > > >>>> will.jones...@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hello,
> > > >>>>>>>>>
> > > >>>>>>>>> I think Sasha brings up a good point, that the advantages of
> > > >> this
> > > >>>>>> format
> > > >>>>>>>>> seem to be primarily about query processing. Other encodings
> > > >> like
> > > >>>>> REE
> > > >>>>>>> and
> > > >>>>>>>>> dictionary have space-saving advantages that justify them
> > > >> simply
> > > >>>> in
> > > >>>>>>> terms
> > > >>>>>>>>> of space efficiency (although they have query processing
> > > >>>> advantages
> > > >>>>> as
> > > >>>>>>>>> well). As discussed, most use cases are already well served
> by
> > > >>>>>> existing
> > > >>>>>>>>> list types and dictionary encoding.
> > > >>>>>>>>>
> > > >>>>>>>>> I agree that there are cases where transferring this type
> > > >> without
> > > >>>>>>>>> conversion would be ideal. One use case I can think of is if
> > > >> Velox
> > > >>>>>>> wants to
> > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > > >> PyArrow,
> > > >>>> for
> > > >>>>>>>>> example) that operate on this column type and therefore wants
> > > >>>>>> zero-copy
> > > >>>>>>>>> exchange over the C Data Interface.
> > > >>>>>>>>>
> > > >>>>>>>>> One big question I have: we already have three list types:
> > > >> list,
> > > >>>>> large
> > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> > > >> will
> > > >>>> only
> > > >>>>>>> want a
> > > >>>>>>>>> view version of the 32-bit offset variable length list? Or
> > > >> are we
> > > >>>>>>>>> potentially talking about view variants for all three?
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>>
> > > >>>>>>>>> Will Jones
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > > >>>>>>>>> felipe...@gmail.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > > >>>>>>> non-deterministic
> > > >>>>>>>>>> order writes is unlocked by the transport and processing of
> > > >> the
> > > >>>>> data
> > > >>>>>>> being
> > > >>>>>>>>>> agnostic to the physical order as much as possible.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it
> > > >> can
> > > >>>>> be a
> > > >>>>>>>>>> provisory step for compatibility between systems that don’t
> > > >>>>>> understand
> > > >>>>>>> the
> > > >>>>>>>>>> format yet. This is similar to the situation with
> compression
> > > >>>>> schemes
> > > >>>>>>> like
> > > >>>>>>>>>> run-end encoding — the goal is processing the compressed
> data
> > > >>>>>> directly
> > > >>>&

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
gt; The benefit of having a memory format that’s friendly to
> > >>>>>>> non-deterministic
> > >>>>>>>>>> order writes is unlocked by the transport and processing of
> > >> the
> > >>>>> data
> > >>>>>>> being
> > >>>>>>>>>> agnostic to the physical order as much as possible.
> > >>>>>>>>>>
> > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it
> > >> can
> > >>>>> be a
> > >>>>>>>>>> provisory step for compatibility between systems that don’t
> > >>>>>> understand
> > >>>>>>> the
> > >>>>>>>>>> format yet. This is similar to the situation with compression
> > >>>>> schemes
> > >>>>>>> like
> > >>>>>>>>>> run-end encoding — the goal is processing the compressed data
> > >>>>>> directly
> > >>>>>>>>>> without an expansion step whenever possible.
> > >>>>>>>>>>
> > >>>>>>>>>> This is why having it as part of the open Arrow format is so
> > >>>>>> important:
> > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> > >> and/or
> > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > >>>>> incompatible
> > >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> > >> step
> > >>>>>> between
> > >>>>>>>>>> the
> > >>>>>>>>>> different systems.
> > >>>>>>>>>>
> > >>>>>>>>>> —
> > >>>>>>>>>> Felipe
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> > >> 
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I don't feel like this representation is necessarily a
> > >> detail of
> > >>>>> the
> > >>>>>>>>>> query
> > >>>>>>>>>>> engine, but I am also not sure why this representation would
> > >>>> have
> > >>>>> to
> > >>>>>>> be
> > >>>>>>>>>>> converted to a non-view format when serializing. Could you
> > >>>> clarify
> > >>>>>>>>>> that? My
> > >>>>>>>>>>> impression is that this representation could be used for
> > >>>>> persistence
> > >>>>>>> or
> > >>>>>>>>>>> data transfer, though it can be more complex to guarantee
> > >> the
> > >>>>>> portion
> > >>>>>>> of
> > >>>>>>>>>>> the buffer that an index points to is also present in
> > >> memory.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Sent from Proton Mail for iOS
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > >>>>>>>>>> krassovskysa...@gmail.com
> > >>>>>>>>>>>
> > >> >
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>> I understand that there are numerous benefits to this
> > >>>>> representation
> > >>>>>>>>>>> during query processing, but would it be fair to say that
> > >> this
> > >>>> is
> > >>>>> an
> > >>>>>>>>>>> implementation detail of the query engine? Query engines
> > >> don’t
> > >>>>>>>>>> necessarily
> > >>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
;>>>>>>>> exchange over the C Data Interface.
> >>>>>>>>>
> >>>>>>>>> One big question I have: we already have three list types:
> >> list,
> >>>>> large
> >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> >> will
> >>>> only
> >>>>>>> want a
> >>>>>>>>> view version of the 32-bit offset variable length list? Or
> >> are we
> >>>>>>>>> potentially talking about view variants for all three?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Will Jones
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> >>>>>>>>> felipe...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> The benefit of having a memory format that’s friendly to
> >>>>>>> non-deterministic
> >>>>>>>>>> order writes is unlocked by the transport and processing of
> >> the
> >>>>> data
> >>>>>>> being
> >>>>>>>>>> agnostic to the physical order as much as possible.
> >>>>>>>>>>
> >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it
> >> can
> >>>>> be a
> >>>>>>>>>> provisory step for compatibility between systems that don’t
> >>>>>> understand
> >>>>>>> the
> >>>>>>>>>> format yet. This is similar to the situation with compression
> >>>>> schemes
> >>>>>>> like
> >>>>>>>>>> run-end encoding — the goal is processing the compressed data
> >>>>>> directly
> >>>>>>>>>> without an expansion step whenever possible.
> >>>>>>>>>>
> >>>>>>>>>> This is why having it as part of the open Arrow format is so
> >>>>>> important:
> >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> >> and/or
> >>>>>>>>>> vectorized compute kernels without introducing multiple
> >>>>> incompatible
> >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> >> step
> >>>>>> between
> >>>>>>>>>> the
> >>>>>>>>>> different systems.
> >>>>>>>>>>
> >>>>>>>>>> —
> >>>>>>>>>> Felipe
> >>>>>>>>>>
> >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> >> 
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I don't feel like this representation is necessarily a
> >> detail of
> >>>>> the
> >>>>>>>>>> query
> >>>>>>>>>>> engine, but I am also not sure why this representation would
> >>>> have
> >>>>> to
> >>>>>>> be
> >>>>>>>>>>> converted to a non-view format when serializing. Could you
> >>>> clarify
> >>>>>>>>>> that? My
> >>>>>>>>>>> impression is that this representation could be used for
> >>>>> persistence
> >>>>>>> or
> >>>>>>>>>>> data transfer, though it can be more complex to guarantee
> >> the
> >>>>>> portion
> >>>>>>> of
> >>>>>>>>>>> the buffer that an index points to is also present in
> >> memory.
> >>>>>>>>>>>
> >>>>>>>>>>> Sent from Proton Mail for iOS
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> >>>>>>>>>> krass

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
tion

during query processing, but would it be fair to say that

this

is

an

implementation detail of the query engine? Query engines

don’t

necessarily

need to conform to the Arrow format internally, only at

ingest/egress

points, and performing a conversion from the non-view to

view

format

seems

like it would be very cheap (though I understand not

necessarily

the

other

way around, but you’d need to do that anyway if you’re

serializing).


Sasha Krassovsky


20 мая 2023 г., в 13:00, Will Jones <

will.jones...@gmail.com>

написал(а):


Thanks for sharing these details, Pedro. The conditional

branches

argument

makes a lot of sense to me.

The tensors point brings up some interesting issues. For

now,

we've

defined

our only tensor extension type to be built on a fixed size

list.

If a

use

case of this might be manipulating tensors with zero copy,

perhaps

that

suggests that we want a fixed size list variant? In

addition,

would

we

have

to define another extension type to be a ListView variant?

Or

would

we

want

to think about making extension types somehow valid across

various

encodings of the same "logical type"?


On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha

Pedreira

 wrote:

Hi all,

This is Pedro from the Velox team at Meta. This is my

first

time

here,

so

nice to e-meet you!

Adding to what Felipe said, the main reason we created

“ListView”

(though

we just call them ArrayVector/MapVector in Velox) is that,

along

with

StringViews for strings, they allow us to write any

columnar

buffer

out-or-order, regardless of their types or encodings.

This is

naturally

doable for all primitive types (fixed-size), but not for

types

that

don’t

have fixed size and are required to be contiguous. The

StringView

and

ListView formats allow us to keep this invariant in Velox.

Being able to write vectors out-of-order is useful when

executing

conditionals like IF/SWITCH statements, which are

pervasive

among

our

workloads. To fully vectorize it, one first evaluates the

expression,

then

generate a bitmap containing which rows take the THEN and

which

take

the

ELSE branch. Then you populate all rows that match the

first

branch

by

evaluating the THEN expression in a vectorized

(branch-less

and

cache

friendly) way, and subsequently the ELSE branch. If you

can’t

write

them

out-of-order, you would either have a big branch per row

dispatching

to

the

right expression (slow), or populate two distinct vectors

then

merging

them

at the end (potentially even slower). How much faster our

approach

is

highly depends on the buffer sizes and expressions, but we

found

it

to

be

faster enough on average to justify us extending the

underlying

layout.


With that said, this is all within a single thread of

execution.

Parallelization is done by feeding each thread its own

vector/data.

As

pointed out in a previous message, this also gives you the

flexibility

to

implement cardinality increasing/reducing operations, but

we

don’t

use

it

for that purpose. Operations like filtering, joining,

unnesting

and

similar

are done by wrapping the internal vector in a dictionary,

as

these

need

to

work not only on “ListViews” but on any data types with

any

encoding.

There

are more details on Section 4.2.1 in [1]

Beyond this, it also gives function/kernel developers more

flexibility

to

implement operations that manipulate Arrays/Maps. For

example,

operations

that slice these containers can be implemented in a

zero-copy

manner

by

just rearranging the lengths/offsets indices, without ever

touching

the

larger internal buffers. This is a similar motivation as

for

StringView

(think of substr(), trim(), and similar). One nice last

property

is

that

this layout allows for overlapping ranges. This is

something

discussed

with

our ML people to allow deduping feature values in a tensor

(which

is

fairly

common), but not something we have leveraged just yet.

[1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

Best,
--
Pedro Pedreira

From: Felipe Oliveira Carvalho 
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org 
Cc: Pedro Eugenio Rocha Pedreira 
Subject: Re: [DISCUSS][Format] Starting the draft

implementation

of

the

ArrayView array format

+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael

Taylor-Davies


taylordavies@ googlemail. com. invalid> wrote: Hi All, >

if

we

added

this, do we think many Arrow and query > engine

implementations

(for

example, DataFusion) will be
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
 wrote:
Hi All,


if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be

eager

to

add

full

support for the type, including compute kernels? Or are

they

likely

to


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Raphael Taylor-Davies

Hi All,

I might be missing something, but rather than opening the can of worms 
of alternative layouts, etc... perhaps we could support this use-case as 
a canonical extension type over dictionary encoded, variable-sized 
arrays. I'll try to explain my reasoning below, but the major advantage 
would be that this would not require any modifications to the existing 
specification or codebases, and I think would support all the use-cases 
I've seen articulated.


Taking a step back:

- Variable-sized arrays encode value positions in a list of 
monotonically increasing integer offsets, with each slice of values 
identified by the two consecutive offsets

- Dictionaries encode value positions in a list of integers in any order
- Views encode value positions as a pair of start and end integer offsets

If you squint, dictionary encoding a variable-sized array removes the 
ordering restriction on the value offsets, resulting in a very similar 
construction to the proposed view arrays. The major differences that 
come to mind are:


- Views can contain partially overlapping value ranges, whereas 
dictionaries can only encode disjoint or identical ranges
- Views require one less addition operation to access a value, and may 
therefore have slightly different performance characteristics
- A null in a view takes up two offsets, a null in a dictionary takes 
one key
- Selection/take/filter on a dictionary is the same as for a primitive 
array, and should be ~2x faster than a view
- The dictionary could use a smaller key size than the offset size of 
its child list, potentially saving space compared to a view
- Unreferenced values in a view are free, whereas they take up an offset 
in the child list

- Kernels may be optimised assuming small and/or unique dictionaries
- IPC files currently only support a single dictionary per column, 
something I suspect we may want to revisit regardless


I do think using dictionaries in this way is perhaps a little confusing 
to users, and a canonical extension type will only go so far to address 
this, but I personally think this is a reasonable price to pay for 
avoiding the ecosystem fragmentation that would potentially result from 
introducing alternative layouts for common types such as strings, whilst 
also avoiding the need to reimplement a load of kernels.


What do people think?

Kind Regards,

Raphael Taylor-Davies

On 14/06/2023 06:34, Will Jones wrote:

Hello Arrow devs,

Just a quick note. To answer one of my earlier questions:

1. Is this array type currently only used in Velox? (not DuckDB like some

of the other new types?) What evidence do we have that it will become used
outside of Velox?


This type is also used by DuckDB. Found discussion today in a talk from
Mark Raasveldt [1]. That does improve the case for adding this type in my
eyes.

Best,

Will Jones

[1] https://youtu.be/bZOvAKGkzpQ?t=1570



On Tue, Jun 6, 2023 at 7:40 PM Weston Pace  wrote:


This implies that each canonical alternative layout would codify a
primary layout as its "fallback."

Yes, that was part of my proposal:


  * A new layout, if it is semantically equivalent to another, is

considered an alternative layout

Or, to phrase it another way.  If there is not a "fallback" then it is not
an alternative layout.  It's a brand new primary layout.  I'd expect this
to be quite rare.  I can't really even hypothesize any examples.  I think
the only truly atomic layouts are fixed-width, list, and struct.


This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported

Most APIs that I'm aware of already do this.  For example,
pyarrow.parquet.read_table has a "read_dictionary" property that can be
used to control whether or not a column is returned with the dictionary
encoding.  There is no way (that I'm aware of) to get a column in REE
encoding today without explicitly requesting it.  In fact, this could be as
simple as a boolean "use_advanced_features" flag although I would
discourage something so simplistic.  The point is that arrow-compatible
software should, by default, emit types that are supported by all arrow
implementations.

Of course, there is no way to enforce this, it's just a guideline / strong
recommendation on how software should behave if it wants to state "arrow
compatible" as a feature.

On Tue, Jun 6, 2023 at 3:33 PM Ian Cook  wrote:


Thanks Weston. That all sounds reasonable to me.


  with the caveat that the primary layout must be emitted if the user

does not specifically request the alternative layout

This implies that each canonical alternative layout would codify a
primary layout as its "fallback." This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported.
I suppose such details should be discussed in a separate thread, but I
raise this here just to point 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-13 Thread Will Jones
 > > > > > > > > >>> format yet. This is similar to the situation with
> > compression
> > > > > > > schemes
> > > > > > > > > like
> > > > > > > > > >>> run-end encoding — the goal is processing the
> compressed
> > data
> > > > > > > > directly
> > > > > > > > > >>> without an expansion step whenever possible.
> > > > > > > > > >>>
> > > > > > > > > >>> This is why having it as part of the open Arrow format
> > is so
> > > > > > > > important:
> > > > > > > > > >>> everyone can agree on a format that’s friendly to
> > parallel
> > > > and/or
> > > > > > > > > >>> vectorized compute kernels without introducing multiple
> > > > > > > incompatible
> > > > > > > > > >>> formats to the ecosystem and without imposing a
> > conversion
> > > > step
> > > > > > > > between
> > > > > > > > > >>> the
> > > > > > > > > >>> different systems.
> > > > > > > > > >>>
> > > > > > > > > >>> —
> > > > > > > > > >>> Felipe
> > > > > > > > > >>>
> > > > > > > > > >>> On Sat, 20 May 2023 at 20:04 Aldrin
> > > > 
> > > > > > > > > wrote:
> > > > > > > > > >>>
> > > > > > > > > >>>> I don't feel like this representation is necessarily a
> > > > detail of
> > > > > > > the
> > > > > > > > > >>> query
> > > > > > > > > >>>> engine, but I am also not sure why this representation
> > would
> > > > > > have
> > > > > > > to
> > > > > > > > > be
> > > > > > > > > >>>> converted to a non-view format when serializing. Could
> > you
> > > > > > clarify
> > > > > > > > > >>> that? My
> > > > > > > > > >>>> impression is that this representation could be used
> for
> > > > > > > persistence
> > > > > > > > > or
> > > > > > > > > >>>> data transfer, though it can be more complex to
> > guarantee
> > > > the
> > > > > > > > portion
> > > > > > > > > of
> > > > > > > > > >>>> the buffer that an index points to is also present in
> > > > memory.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Sent from Proton Mail for iOS
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > > > > >>> krassovskysa...@gmail.com
> > > > > > > > > >>>>
> > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> Hi everyone,
> > > > > > > > > >>>> I understand that there are numerous benefits to this
> > > > > > > representation
> > > > > > > > > >>>> during query processing, but would it be fair to say
> > that
> > > > this
> > > > > > is
> > > > > > > an
> > > > > > > > > >>>> implementation detail of the query engine? Query
> engines
> > > > don’t
> > > > > > > > > >>> necessarily
> > > > > > > > > >>>> need to conform to the Arrow format internally, only
> at
> > > > > > > > ingest/egress
> > > > > > > > > >>>> points, and performing a conversion from the non-view
> to
> > > > view
> > > > > > > format
> > > > > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
> > > > > > > > >> felipe...@gmail.com> wrote:
> > > > > > > > >>
> > > > > > > > >>> The benefit of having a memory format that’s friendly to
> > > > > > > > non-deterministic
> > > > > > > > >>> order writes is unlocked by the transport and processing
> of
> > > the
> > > > > > data
> > > > > > > > being
> > > > > > > > >>> agnostic to the physical order as much as possible.
> > > > > > > > >>>
> > > > > > > > >>> Requiring a conversion could cancel out that benefit.
> But it
> > > can
> > > > > > be a
> > > > > > > > >>> provisory step for compatibility between systems that
> don’t
> > > > > > > understand
> > > > > > > > the
> > > > > > > > >>> format yet. This is similar to the situation with
> compression
> > > > > > schemes
> > > > > > > > like
> > > > > > > > >>> run-end encoding — the goal is processing the compressed
> data
> > > > > > > directly
> > > > > > > > >>> without an expansion step whenever possible.
> > > > > > > > >>>
> > > > > > > > >>> This is why having it as part of the open Arrow format
> is so
> > > > > > > important:
> > > > > > > > >>> everyone can agree on a format that’s friendly to
> parallel
> > > and/or
> > > > > > > > >>> vectorized compute kernels without introducing multiple
> > > > > > incompatible
> > > > > > > > >>> formats to the ecosystem and without imposing a
> conversion
> > > step
> > > > > > > between
> > > > > > > > >>> the
> > > > > > > > >>> different systems.
> > > > > > > > >>>
> > > > > > > > >>> —
> > > > > > > > >>> Felipe
> > > > > > > > >>>
> > > > > > > > >>> On Sat, 20 May 2023 at 20:04 Aldrin
> > > 
> > > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> I don't feel like this representation is necessarily a
> > > detail of
> > > > > > the
> > > > > > > > >>> query
> > > > > > > > >>>> engine, but I am also not sure why this representation
> would
> > > > > have
> > > > > > to
> > > > > > > > be
> > > > > > > > >>>> converted to a non-view format when serializing. Could
> you
> > > > > clarify
> > > > > > > > >>> that? My
> > > > > > > > >>>> impression is that this representation could be used for
> > > > > > persistence
> > > > > > > > or
> > > > > > > > >>>> data transfer, though it can be more complex to
> guarantee
> > > the
> > > > > > > portion
> > > > > > > > of
> > > > > > > > >>>> the buffer that an index points to is also present in
> > > memory.
> > > > > > > > >>>>
> > > > > > > > >>>> Sent from Proton Mail for iOS
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > > > >>> krassovskysa...@gmail.com
> > > > > > > > >>>>
> > > >
> > > > > > > > wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> Hi everyone,
> > > > > > > > >>>> I understand that there are numerous benefits to this
> > > > > > representation
> > > > > > > > >>>> during query processing, but would it be fair to say
> that
> > > this
> > > > &

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook
t; > > > > > >> potentially talking about view variants for all three?
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >>
> > > > > > > >> Will Jones
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > > > > > > >> felipe...@gmail.com> wrote:
> > > > > > > >>
> > > > > > > >>> The benefit of having a memory format that’s friendly to
> > > > > > > non-deterministic
> > > > > > > >>> order writes is unlocked by the transport and processing of
> > the
> > > > > data
> > > > > > > being
> > > > > > > >>> agnostic to the physical order as much as possible.
> > > > > > > >>>
> > > > > > > >>> Requiring a conversion could cancel out that benefit. But it
> > can
> > > > > be a
> > > > > > > >>> provisory step for compatibility between systems that don’t
> > > > > > understand
> > > > > > > the
> > > > > > > >>> format yet. This is similar to the situation with compression
> > > > > schemes
> > > > > > > like
> > > > > > > >>> run-end encoding — the goal is processing the compressed data
> > > > > > directly
> > > > > > > >>> without an expansion step whenever possible.
> > > > > > > >>>
> > > > > > > >>> This is why having it as part of the open Arrow format is so
> > > > > > important:
> > > > > > > >>> everyone can agree on a format that’s friendly to parallel
> > and/or
> > > > > > > >>> vectorized compute kernels without introducing multiple
> > > > > incompatible
> > > > > > > >>> formats to the ecosystem and without imposing a conversion
> > step
> > > > > > between
> > > > > > > >>> the
> > > > > > > >>> different systems.
> > > > > > > >>>
> > > > > > > >>> —
> > > > > > > >>> Felipe
> > > > > > > >>>
> > > > > > > >>> On Sat, 20 May 2023 at 20:04 Aldrin
> > 
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> I don't feel like this representation is necessarily a
> > detail of
> > > > > the
> > > > > > > >>> query
> > > > > > > >>>> engine, but I am also not sure why this representation would
> > > > have
> > > > > to
> > > > > > > be
> > > > > > > >>>> converted to a non-view format when serializing. Could you
> > > > clarify
> > > > > > > >>> that? My
> > > > > > > >>>> impression is that this representation could be used for
> > > > > persistence
> > > > > > > or
> > > > > > > >>>> data transfer, though it can be more complex to guarantee
> > the
> > > > > > portion
> > > > > > > of
> > > > > > > >>>> the buffer that an index points to is also present in
> > memory.
> > > > > > > >>>>
> > > > > > > >>>> Sent from Proton Mail for iOS
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > > >>> krassovskysa...@gmail.com
> > > > > > > >>>>
> > >
> > > > > > > wrote:
> > > > > > > >>>>
> > > > > > > >>>> Hi everyone,
> > > > > > > >>>> I understand that there are numerous benefits to this
> > > > > representation
> > > > > > > >>>> during query processing, but would it be fair t

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Felipe Oliveira Carvalho
t; > > > >>> The benefit of having a memory format that’s friendly to
> > > > > > non-deterministic
> > > > > > >>> order writes is unlocked by the transport and processing of
> the
> > > > data
> > > > > > being
> > > > > > >>> agnostic to the physical order as much as possible.
> > > > > > >>>
> > > > > > >>> Requiring a conversion could cancel out that benefit. But it
> can
> > > > be a
> > > > > > >>> provisory step for compatibility between systems that don’t
> > > > > understand
> > > > > > the
> > > > > > >>> format yet. This is similar to the situation with compression
> > > > schemes
> > > > > > like
> > > > > > >>> run-end encoding — the goal is processing the compressed data
> > > > > directly
> > > > > > >>> without an expansion step whenever possible.
> > > > > > >>>
> > > > > > >>> This is why having it as part of the open Arrow format is so
> > > > > important:
> > > > > > >>> everyone can agree on a format that’s friendly to parallel
> and/or
> > > > > > >>> vectorized compute kernels without introducing multiple
> > > > incompatible
> > > > > > >>> formats to the ecosystem and without imposing a conversion
> step
> > > > > between
> > > > > > >>> the
> > > > > > >>> different systems.
> > > > > > >>>
> > > > > > >>> —
> > > > > > >>> Felipe
> > > > > > >>>
> > > > > > >>> On Sat, 20 May 2023 at 20:04 Aldrin
> 
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> I don't feel like this representation is necessarily a
> detail of
> > > > the
> > > > > > >>> query
> > > > > > >>>> engine, but I am also not sure why this representation would
> > > have
> > > > to
> > > > > > be
> > > > > > >>>> converted to a non-view format when serializing. Could you
> > > clarify
> > > > > > >>> that? My
> > > > > > >>>> impression is that this representation could be used for
> > > > persistence
> > > > > > or
> > > > > > >>>> data transfer, though it can be more complex to guarantee
> the
> > > > > portion
> > > > > > of
> > > > > > >>>> the buffer that an index points to is also present in
> memory.
> > > > > > >>>>
> > > > > > >>>> Sent from Proton Mail for iOS
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > >>> krassovskysa...@gmail.com
> > > > > > >>>>
> >
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>> Hi everyone,
> > > > > > >>>> I understand that there are numerous benefits to this
> > > > representation
> > > > > > >>>> during query processing, but would it be fair to say that
> this
> > > is
> > > > an
> > > > > > >>>> implementation detail of the query engine? Query engines
> don’t
> > > > > > >>> necessarily
> > > > > > >>>> need to conform to the Arrow format internally, only at
> > > > > ingest/egress
> > > > > > >>>> points, and performing a conversion from the non-view to
> view
> > > > format
> > > > > > >>> seems
> > > > > > >>>> like it would be very cheap (though I understand not
> necessarily
> > > > the
> > > > > > >>> other
> > > > > > >>>> way around, but you’d need to do that anyway if you’re
> > > > serializing).
> > > > > > >>>>
> > > > > > >>>&g

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
I am also not sure why this representation would
> > > have
> > > > to
> > > > > > be
> > > > > > >>>> converted to a non-view format when serializing. Could you
> > > clarify
> > > > > > >>> that? My
> > > > > > >>>> impression is that this representation could be used for
> > > > persistence
> > > > > > or
> > > > > > >>>> data transfer, though it can be more complex to guarantee
> the
> > > > > portion
> > > > > > of
> > > > > > >>>> the buffer that an index points to is also present in
> memory.
> > > > > > >>>>
> > > > > > >>>> Sent from Proton Mail for iOS
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > >>> krassovskysa...@gmail.com
> > > > > > >>>>
> >
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>> Hi everyone,
> > > > > > >>>> I understand that there are numerous benefits to this
> > > > representation
> > > > > > >>>> during query processing, but would it be fair to say that
> this
> > > is
> > > > an
> > > > > > >>>> implementation detail of the query engine? Query engines
> don’t
> > > > > > >>> necessarily
> > > > > > >>>> need to conform to the Arrow format internally, only at
> > > > > ingest/egress
> > > > > > >>>> points, and performing a conversion from the non-view to
> view
> > > > format
> > > > > > >>> seems
> > > > > > >>>> like it would be very cheap (though I understand not
> necessarily
> > > > the
> > > > > > >>> other
> > > > > > >>>> way around, but you’d need to do that anyway if you’re
> > > > serializing).
> > > > > > >>>>
> > > > > > >>>> Sasha Krassovsky
> > > > > > >>>>
> > > > > > >>>>> 20 мая 2023 г., в 13:00, Will Jones <
> will.jones...@gmail.com>
> > > > > > >>>> написал(а):
> > > > > > >>>>>
> > > > > > >>>>> Thanks for sharing these details, Pedro. The conditional
> > > > branches
> > > > > > >>>> argument
> > > > > > >>>>> makes a lot of sense to me.
> > > > > > >>>>>
> > > > > > >>>>> The tensors point brings up some interesting issues. For
> now,
> > > > we've
> > > > > > >>>> defined
> > > > > > >>>>> our only tensor extension type to be built on a fixed size
> > > list.
> > > > > If a
> > > > > > >>> use
> > > > > > >>>>> case of this might be manipulating tensors with zero copy,
> > > > perhaps
> > > > > > >>> that
> > > > > > >>>>> suggests that we want a fixed size list variant? In
> addition,
> > > > would
> > > > > > we
> > > > > > >>>> have
> > > > > > >>>>> to define another extension type to be a ListView variant?
> Or
> > > > would
> > > > > > we
> > > > > > >>>> want
> > > > > > >>>>> to think about making extension types somehow valid across
> > > > various
> > > > > > >>>>> encodings of the same "logical type"?
> > > > > > >>>>>
> > > > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha
> Pedreira
> > > > > > >>>>>>  wrote:
> > > > > > >>>>>>
> > > > > > >>>>>> Hi all,
> > > > > > >>>>>>
> > > > > > >>>>>> This is Pedro from the Velox team at Meta. Thi

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook
> > > > >>> that? My
> > > > > >>>> impression is that this representation could be used for
> > > persistence
> > > > > or
> > > > > >>>> data transfer, though it can be more complex to guarantee the
> > > > portion
> > > > > of
> > > > > >>>> the buffer that an index points to is also present in memory.
> > > > > >>>>
> > > > > >>>> Sent from Proton Mail for iOS
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > >>> krassovskysa...@gmail.com
> > > > > >>>> >
> > > > > wrote:
> > > > > >>>>
> > > > > >>>> Hi everyone,
> > > > > >>>> I understand that there are numerous benefits to this
> > > representation
> > > > > >>>> during query processing, but would it be fair to say that this
> > is
> > > an
> > > > > >>>> implementation detail of the query engine? Query engines don’t
> > > > > >>> necessarily
> > > > > >>>> need to conform to the Arrow format internally, only at
> > > > ingest/egress
> > > > > >>>> points, and performing a conversion from the non-view to view
> > > format
> > > > > >>> seems
> > > > > >>>> like it would be very cheap (though I understand not necessarily
> > > the
> > > > > >>> other
> > > > > >>>> way around, but you’d need to do that anyway if you’re
> > > serializing).
> > > > > >>>>
> > > > > >>>> Sasha Krassovsky
> > > > > >>>>
> > > > > >>>>> 20 мая 2023 г., в 13:00, Will Jones 
> > > > > >>>> написал(а):
> > > > > >>>>>
> > > > > >>>>> Thanks for sharing these details, Pedro. The conditional
> > > branches
> > > > > >>>> argument
> > > > > >>>>> makes a lot of sense to me.
> > > > > >>>>>
> > > > > >>>>> The tensors point brings up some interesting issues. For now,
> > > we've
> > > > > >>>> defined
> > > > > >>>>> our only tensor extension type to be built on a fixed size
> > list.
> > > > If a
> > > > > >>> use
> > > > > >>>>> case of this might be manipulating tensors with zero copy,
> > > perhaps
> > > > > >>> that
> > > > > >>>>> suggests that we want a fixed size list variant? In addition,
> > > would
> > > > > we
> > > > > >>>> have
> > > > > >>>>> to define another extension type to be a ListView variant? Or
> > > would
> > > > > we
> > > > > >>>> want
> > > > > >>>>> to think about making extension types somehow valid across
> > > various
> > > > > >>>>> encodings of the same "logical type"?
> > > > > >>>>>
> > > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > > > > >>>>>>  wrote:
> > > > > >>>>>>
> > > > > >>>>>> Hi all,
> > > > > >>>>>>
> > > > > >>>>>> This is Pedro from the Velox team at Meta. This is my first
> > time
> > > > > >>> here,
> > > > > >>>> so
> > > > > >>>>>> nice to e-meet you!
> > > > > >>>>>>
> > > > > >>>>>> Adding to what Felipe said, the main reason we created
> > > “ListView”
> > > > > >>>> (though
> > > > > >>>>>> we just call them ArrayVector/MapVector in Velox) is that,
> > along
> > > > > with
> > > > > >>>>>> StringViews for strings, they allow us to write any columnar
> > > > buffer
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-27 Thread Micah Kornfield
> > various
> > > > >>>>> encodings of the same "logical type"?
> > > > >>>>>
> > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > > > >>>>>>  wrote:
> > > > >>>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> This is Pedro from the Velox team at Meta. This is my first
> time
> > > > >>> here,
> > > > >>>> so
> > > > >>>>>> nice to e-meet you!
> > > > >>>>>>
> > > > >>>>>> Adding to what Felipe said, the main reason we created
> > “ListView”
> > > > >>>> (though
> > > > >>>>>> we just call them ArrayVector/MapVector in Velox) is that,
> along
> > > > with
> > > > >>>>>> StringViews for strings, they allow us to write any columnar
> > > buffer
> > > > >>>>>> out-or-order, regardless of their types or encodings. This is
> > > > >>> naturally
> > > > >>>>>> doable for all primitive types (fixed-size), but not for types
> > > that
> > > > >>>> don’t
> > > > >>>>>> have fixed size and are required to be contiguous. The
> > StringView
> > > > and
> > > > >>>>>> ListView formats allow us to keep this invariant in Velox.
> > > > >>>>>>
> > > > >>>>>> Being able to write vectors out-of-order is useful when
> > executing
> > > > >>>>>> conditionals like IF/SWITCH statements, which are pervasive
> > among
> > > > our
> > > > >>>>>> workloads. To fully vectorize it, one first evaluates the
> > > > expression,
> > > > >>>> then
> > > > >>>>>> generate a bitmap containing which rows take the THEN and
> which
> > > take
> > > > >>> the
> > > > >>>>>> ELSE branch. Then you populate all rows that match the first
> > > branch
> > > > >>> by
> > > > >>>>>> evaluating the THEN expression in a vectorized (branch-less
> and
> > > > cache
> > > > >>>>>> friendly) way, and subsequently the ELSE branch. If you can’t
> > > write
> > > > >>> them
> > > > >>>>>> out-of-order, you would either have a big branch per row
> > > dispatching
> > > > >>> to
> > > > >>>> the
> > > > >>>>>> right expression (slow), or populate two distinct vectors then
> > > > >>> merging
> > > > >>>> them
> > > > >>>>>> at the end (potentially even slower). How much faster our
> > approach
> > > > is
> > > > >>>>>> highly depends on the buffer sizes and expressions, but we
> found
> > > it
> > > > >>> to
> > > > >>>> be
> > > > >>>>>> faster enough on average to justify us extending the
> underlying
> > > > >>> layout.
> > > > >>>>>>
> > > > >>>>>> With that said, this is all within a single thread of
> execution.
> > > > >>>>>> Parallelization is done by feeding each thread its own
> > > vector/data.
> > > > >>> As
> > > > >>>>>> pointed out in a previous message, this also gives you the
> > > > >>> flexibility
> > > > >>>> to
> > > > >>>>>> implement cardinality increasing/reducing operations, but we
> > don’t
> > > > >>> use
> > > > >>>> it
> > > > >>>>>> for that purpose. Operations like filtering, joining,
> unnesting
> > > and
> > > > >>>> similar
> > > > >>>>>> are done by wrapping the internal vector in a dictionary, as
> > these
> > > > >>> need
> > > > >>>> to
> > > > >>>>>> work not only on “ListViews” but on any data types with any
> > > > encoding.
> > > > >>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Weston Pace
> > > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > >>> krassovskysa...@gmail.com
> > > >>>> >
> > > wrote:
> > > >>>>
> > > >>>> Hi everyone,
> > > >>>> I understand that there are numerous benefits to this
> representation
> > > >>>> during query processing, but would it be fair to say that this is
> an
> > > >>>> implementation detail of the query engine? Query engines don’t
> > > >>> necessarily
> > > >>>> need to conform to the Arrow format internally, only at
> > ingest/egress
> > > >>>> points, and performing a conversion from the non-view to view
> format
> > > >>> seems
> > > >>>> like it would be very cheap (though I understand not necessarily
> the
> > > >>> other
> > > >>>> way around, but you’d need to do that anyway if you’re
> serializing).
> > > >>>>
> > > >>>> Sasha Krassovsky
> > > >>>>
> > > >>>>> 20 мая 2023 г., в 13:00, Will Jones 
> > > >>>> написал(а):
> > > >>>>>
> > > >>>>> Thanks for sharing these details, Pedro. The conditional
> branches
> > > >>>> argument
> > > >>>>> makes a lot of sense to me.
> > > >>>>>
> > > >>>>> The tensors point brings up some interesting issues. For now,
> we've
> > > >>>> defined
> > > >>>>> our only tensor extension type to be built on a fixed size list.
> > If a
> > > >>> use
> > > >>>>> case of this might be manipulating tensors with zero copy,
> perhaps
> > > >>> that
> > > >>>>> suggests that we want a fixed size list variant? In addition,
> would
> > > we
> > > >>>> have
> > > >>>>> to define another extension type to be a ListView variant? Or
> would
> > > we
> > > >>>> want
> > > >>>>> to think about making extension types somehow valid across
> various
> > > >>>>> encodings of the same "logical type"?
> > > >>>>>
> > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > > >>>>>>  wrote:
> > > >>>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> This is Pedro from the Velox team at Meta. This is my first time
> > > >>> here,
> > > >>>> so
> > > >>>>>> nice to e-meet you!
> > > >>>>>>
> > > >>>>>> Adding to what Felipe said, the main reason we created
> “ListView”
> > > >>>> (though
> > > >>>>>> we just call them ArrayVector/MapVector in Velox) is that, along
> > > with
> > > >>>>>> StringViews for strings, they allow us to write any columnar
> > buffer
> > > >>>>>> out-or-order, regardless of their types or encodings. This is
> > > >>> naturally
> > > >>>>>> doable for all primitive types (fixed-size), but not for types
> > that
> > > >>>> don’t
> > > >>>>>> have fixed size and are required to be contiguous. The
> StringView
> > > and
> > > >>>>>> ListView formats allow us to keep this invariant in Velox.
> > > >>>>>>
> > > >>>>>> Being able to write vectors out-of-order is useful when
> executing
> > > >>>>>> conditionals like IF/SWITCH statements, which are pervasive
> among
> > > our
> > > >>>>>> workloads. To fully vectorize it, one first evaluates the
> > > expression,
> > > >>>> then
> > > >>>>>> generate a bitmap containing which rows take the THEN and which
> > take
> > > >>> the
> > > >>>>>> ELSE branch. Then you populate all rows that match the first
> > branch
> > > >>> by
> > > >>>>>> evaluating the THEN expression in a vectorized (branch-less and
> > > cache
> > > >>>>>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Will Jones
be
> > >>>> converted to a non-view format when serializing. Could you clarify
> > >>> that? My
> > >>>> impression is that this representation could be used for persistence
> > or
> > >>>> data transfer, though it can be more complex to guarantee the
> portion
> > of
> > >>>> the buffer that an index points to is also present in memory.
> > >>>>
> > >>>> Sent from Proton Mail for iOS
> > >>>>
> > >>>>
> > >>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > >>> krassovskysa...@gmail.com
> > >>>> >
> > wrote:
> > >>>>
> > >>>> Hi everyone,
> > >>>> I understand that there are numerous benefits to this representation
> > >>>> during query processing, but would it be fair to say that this is an
> > >>>> implementation detail of the query engine? Query engines don’t
> > >>> necessarily
> > >>>> need to conform to the Arrow format internally, only at
> ingest/egress
> > >>>> points, and performing a conversion from the non-view to view format
> > >>> seems
> > >>>> like it would be very cheap (though I understand not necessarily the
> > >>> other
> > >>>> way around, but you’d need to do that anyway if you’re serializing).
> > >>>>
> > >>>> Sasha Krassovsky
> > >>>>
> > >>>>> 20 мая 2023 г., в 13:00, Will Jones 
> > >>>> написал(а):
> > >>>>>
> > >>>>> Thanks for sharing these details, Pedro. The conditional branches
> > >>>> argument
> > >>>>> makes a lot of sense to me.
> > >>>>>
> > >>>>> The tensors point brings up some interesting issues. For now, we've
> > >>>> defined
> > >>>>> our only tensor extension type to be built on a fixed size list.
> If a
> > >>> use
> > >>>>> case of this might be manipulating tensors with zero copy, perhaps
> > >>> that
> > >>>>> suggests that we want a fixed size list variant? In addition, would
> > we
> > >>>> have
> > >>>>> to define another extension type to be a ListView variant? Or would
> > we
> > >>>> want
> > >>>>> to think about making extension types somehow valid across various
> > >>>>> encodings of the same "logical type"?
> > >>>>>
> > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > >>>>>>  wrote:
> > >>>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> This is Pedro from the Velox team at Meta. This is my first time
> > >>> here,
> > >>>> so
> > >>>>>> nice to e-meet you!
> > >>>>>>
> > >>>>>> Adding to what Felipe said, the main reason we created “ListView”
> > >>>> (though
> > >>>>>> we just call them ArrayVector/MapVector in Velox) is that, along
> > with
> > >>>>>> StringViews for strings, they allow us to write any columnar
> buffer
> > >>>>>> out-or-order, regardless of their types or encodings. This is
> > >>> naturally
> > >>>>>> doable for all primitive types (fixed-size), but not for types
> that
> > >>>> don’t
> > >>>>>> have fixed size and are required to be contiguous. The StringView
> > and
> > >>>>>> ListView formats allow us to keep this invariant in Velox.
> > >>>>>>
> > >>>>>> Being able to write vectors out-of-order is useful when executing
> > >>>>>> conditionals like IF/SWITCH statements, which are pervasive among
> > our
> > >>>>>> workloads. To fully vectorize it, one first evaluates the
> > expression,
> > >>>> then
> > >>>>>> generate a bitmap containing which rows take the THEN and which
> take
> > >>> the
> > >>>>>> ELSE branch. Then you populate all rows that match the first
> branch
> > >>> by
> > >>>>>> evaluating the THEN expr

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Andrew Lamb
;>> Being able to write vectors out-of-order is useful when executing
> >>>>>> conditionals like IF/SWITCH statements, which are pervasive among
> our
> >>>>>> workloads. To fully vectorize it, one first evaluates the
> expression,
> >>>> then
> >>>>>> generate a bitmap containing which rows take the THEN and which take
> >>> the
> >>>>>> ELSE branch. Then you populate all rows that match the first branch
> >>> by
> >>>>>> evaluating the THEN expression in a vectorized (branch-less and
> cache
> >>>>>> friendly) way, and subsequently the ELSE branch. If you can’t write
> >>> them
> >>>>>> out-of-order, you would either have a big branch per row dispatching
> >>> to
> >>>> the
> >>>>>> right expression (slow), or populate two distinct vectors then
> >>> merging
> >>>> them
> >>>>>> at the end (potentially even slower). How much faster our approach
> is
> >>>>>> highly depends on the buffer sizes and expressions, but we found it
> >>> to
> >>>> be
> >>>>>> faster enough on average to justify us extending the underlying
> >>> layout.
> >>>>>>
> >>>>>> With that said, this is all within a single thread of execution.
> >>>>>> Parallelization is done by feeding each thread its own vector/data.
> >>> As
> >>>>>> pointed out in a previous message, this also gives you the
> >>> flexibility
> >>>> to
> >>>>>> implement cardinality increasing/reducing operations, but we don’t
> >>> use
> >>>> it
> >>>>>> for that purpose. Operations like filtering, joining, unnesting and
> >>>> similar
> >>>>>> are done by wrapping the internal vector in a dictionary, as these
> >>> need
> >>>> to
> >>>>>> work not only on “ListViews” but on any data types with any
> encoding.
> >>>> There
> >>>>>> are more details on Section 4.2.1 in [1]
> >>>>>>
> >>>>>> Beyond this, it also gives function/kernel developers more
> >>> flexibility
> >>>> to
> >>>>>> implement operations that manipulate Arrays/Maps. For example,
> >>>> operations
> >>>>>> that slice these containers can be implemented in a zero-copy manner
> >>> by
> >>>>>> just rearranging the lengths/offsets indices, without ever touching
> >>> the
> >>>>>> larger internal buffers. This is a similar motivation as for
> >>> StringView
> >>>>>> (think of substr(), trim(), and similar). One nice last property is
> >>> that
> >>>>>> this layout allows for overlapping ranges. This is something
> >>> discussed
> >>>> with
> >>>>>> our ML people to allow deduping feature values in a tensor (which is
> >>>> fairly
> >>>>>> common), but not something we have leveraged just yet.
> >>>>>>
> >>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> >>>>>>
> >>>>>> Best,
> >>>>>> --
> >>>>>> Pedro Pedreira
> >>>>>> 
> >>>>>> From: Felipe Oliveira Carvalho 
> >>>>>> Sent: Friday, May 19, 2023 10:01 AM
> >>>>>> To: dev@arrow.apache.org 
> >>>>>> Cc: Pedro Eugenio Rocha Pedreira 
> >>>>>> Subject: Re: [DISCUSS][Format] Starting the draft implementation of
> >>> the
> >>>>>> ArrayView array format
> >>>>>>
> >>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  >>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi All, > if we
> added
> >>>>>> this, do we think many Arrow and query > engine implementations (for
> >>>>>> example, DataFusion) will be
> >>>>>> ZjQcmQRYFpfptBannerStart
> >>>>>> This Message Is From an External Sender
> >>>>>>
> >>>>>> ZjQcmQRYFpfptBannerEnd
> >>>>>> +pedroerp
> >>&g

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Antoine Pitrou
s a lot of sense to me.

The tensors point brings up some interesting issues. For now, we've

defined

our only tensor extension type to be built on a fixed size list. If a

use

case of this might be manipulating tensors with zero copy, perhaps

that

suggests that we want a fixed size list variant? In addition, would we

have

to define another extension type to be a ListView variant? Or would we

want

to think about making extension types somehow valid across various
encodings of the same "logical type"?


On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
 wrote:

Hi all,

This is Pedro from the Velox team at Meta. This is my first time

here,

so

nice to e-meet you!

Adding to what Felipe said, the main reason we created “ListView”

(though

we just call them ArrayVector/MapVector in Velox) is that, along with
StringViews for strings, they allow us to write any columnar buffer
out-or-order, regardless of their types or encodings. This is

naturally

doable for all primitive types (fixed-size), but not for types that

don’t

have fixed size and are required to be contiguous. The StringView and
ListView formats allow us to keep this invariant in Velox.

Being able to write vectors out-of-order is useful when executing
conditionals like IF/SWITCH statements, which are pervasive among our
workloads. To fully vectorize it, one first evaluates the expression,

then

generate a bitmap containing which rows take the THEN and which take

the

ELSE branch. Then you populate all rows that match the first branch

by

evaluating the THEN expression in a vectorized (branch-less and cache
friendly) way, and subsequently the ELSE branch. If you can’t write

them

out-of-order, you would either have a big branch per row dispatching

to

the

right expression (slow), or populate two distinct vectors then

merging

them

at the end (potentially even slower). How much faster our approach is
highly depends on the buffer sizes and expressions, but we found it

to

be

faster enough on average to justify us extending the underlying

layout.


With that said, this is all within a single thread of execution.
Parallelization is done by feeding each thread its own vector/data.

As

pointed out in a previous message, this also gives you the

flexibility

to

implement cardinality increasing/reducing operations, but we don’t

use

it

for that purpose. Operations like filtering, joining, unnesting and

similar

are done by wrapping the internal vector in a dictionary, as these

need

to

work not only on “ListViews” but on any data types with any encoding.

There

are more details on Section 4.2.1 in [1]

Beyond this, it also gives function/kernel developers more

flexibility

to

implement operations that manipulate Arrays/Maps. For example,

operations

that slice these containers can be implemented in a zero-copy manner

by

just rearranging the lengths/offsets indices, without ever touching

the

larger internal buffers. This is a similar motivation as for

StringView

(think of substr(), trim(), and similar). One nice last property is

that

this layout allows for overlapping ranges. This is something

discussed

with

our ML people to allow deduping feature values in a tensor (which is

fairly

common), but not something we have leveraged just yet.

[1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

Best,
--
Pedro Pedreira

From: Felipe Oliveira Carvalho 
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org 
Cc: Pedro Eugenio Rocha Pedreira 
Subject: Re: [DISCUSS][Format] Starting the draft implementation of

the

ArrayView array format

+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  wrote: Hi All, > if we added
this, do we think many Arrow and query > engine implementations (for
example, DataFusion) will be
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
 wrote:
Hi All,


if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to

add

full

support for the type, including compute kernels? Or are they likely

to

just

convert this type to ListArray at import boundaries?

I can't speak for query engines in general, but at least for arrow-rs
and by extension DataFusion, and based on my current understanding of
the use-cases I would be rather hesitant to add support to the

kernels

for this array type, definitely instead favouring conversion at the
edges. We already have issues with the amount of code generation
resulting in binary bloat and long compile times, and I worry this

would

worsen this situation whilst not really providing compelling

advantages

for the vast majority of workloads that don't interact with Velox.
Whilst I can definitely see that the ListView representation is

probably

a better way to represent variable length lists than what arrow

settled

upon, I'm not y

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
;> >
>> > Sasha Krassovsky
>> >
>> > > 20 мая 2023 г., в 13:00, Will Jones 
>> > написал(а):
>> > >
>> > > Thanks for sharing these details, Pedro. The conditional branches
>> > argument
>> > > makes a lot of sense to me.
>> > >
>> > > The tensors point brings up some interesting issues. For now, we've
>> > defined
>> > > our only tensor extension type to be built on a fixed size list. If a
>> use
>> > > case of this might be manipulating tensors with zero copy, perhaps
>> that
>> > > suggests that we want a fixed size list variant? In addition, would we
>> > have
>> > > to define another extension type to be a ListView variant? Or would we
>> > want
>> > > to think about making extension types somehow valid across various
>> > > encodings of the same "logical type"?
>> > >
>> > >> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
>> > >>  wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> This is Pedro from the Velox team at Meta. This is my first time
>> here,
>> > so
>> > >> nice to e-meet you!
>> > >>
>> > >> Adding to what Felipe said, the main reason we created “ListView”
>> > (though
>> > >> we just call them ArrayVector/MapVector in Velox) is that, along with
>> > >> StringViews for strings, they allow us to write any columnar buffer
>> > >> out-or-order, regardless of their types or encodings. This is
>> naturally
>> > >> doable for all primitive types (fixed-size), but not for types that
>> > don’t
>> > >> have fixed size and are required to be contiguous. The StringView and
>> > >> ListView formats allow us to keep this invariant in Velox.
>> > >>
>> > >> Being able to write vectors out-of-order is useful when executing
>> > >> conditionals like IF/SWITCH statements, which are pervasive among our
>> > >> workloads. To fully vectorize it, one first evaluates the expression,
>> > then
>> > >> generate a bitmap containing which rows take the THEN and which take
>> the
>> > >> ELSE branch. Then you populate all rows that match the first branch
>> by
>> > >> evaluating the THEN expression in a vectorized (branch-less and cache
>> > >> friendly) way, and subsequently the ELSE branch. If you can’t write
>> them
>> > >> out-of-order, you would either have a big branch per row dispatching
>> to
>> > the
>> > >> right expression (slow), or populate two distinct vectors then
>> merging
>> > them
>> > >> at the end (potentially even slower). How much faster our approach is
>> > >> highly depends on the buffer sizes and expressions, but we found it
>> to
>> > be
>> > >> faster enough on average to justify us extending the underlying
>> layout.
>> > >>
>> > >> With that said, this is all within a single thread of execution.
>> > >> Parallelization is done by feeding each thread its own vector/data.
>> As
>> > >> pointed out in a previous message, this also gives you the
>> flexibility
>> > to
>> > >> implement cardinality increasing/reducing operations, but we don’t
>> use
>> > it
>> > >> for that purpose. Operations like filtering, joining, unnesting and
>> > similar
>> > >> are done by wrapping the internal vector in a dictionary, as these
>> need
>> > to
>> > >> work not only on “ListViews” but on any data types with any encoding.
>> > There
>> > >> are more details on Section 4.2.1 in [1]
>> > >>
>> > >> Beyond this, it also gives function/kernel developers more
>> flexibility
>> > to
>> > >> implement operations that manipulate Arrays/Maps. For example,
>> > operations
>> > >> that slice these containers can be implemented in a zero-copy manner
>> by
>> > >> just rearranging the lengths/offsets indices, without ever touching
>> the
>> > >> larger internal buffers. This is a similar motivation as for
>> StringView
>> > >> (think of substr(), trim(), and similar). One nice last property is
>> that
>> > >> this layout allows for overlapping ranges. This is something
>&

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
order, regardless of their types or encodings. This is
> naturally
> > >> doable for all primitive types (fixed-size), but not for types that
> > don’t
> > >> have fixed size and are required to be contiguous. The StringView and
> > >> ListView formats allow us to keep this invariant in Velox.
> > >>
> > >> Being able to write vectors out-of-order is useful when executing
> > >> conditionals like IF/SWITCH statements, which are pervasive among our
> > >> workloads. To fully vectorize it, one first evaluates the expression,
> > then
> > >> generate a bitmap containing which rows take the THEN and which take
> the
> > >> ELSE branch. Then you populate all rows that match the first branch by
> > >> evaluating the THEN expression in a vectorized (branch-less and cache
> > >> friendly) way, and subsequently the ELSE branch. If you can’t write
> them
> > >> out-of-order, you would either have a big branch per row dispatching
> to
> > the
> > >> right expression (slow), or populate two distinct vectors then merging
> > them
> > >> at the end (potentially even slower). How much faster our approach is
> > >> highly depends on the buffer sizes and expressions, but we found it to
> > be
> > >> faster enough on average to justify us extending the underlying
> layout.
> > >>
> > >> With that said, this is all within a single thread of execution.
> > >> Parallelization is done by feeding each thread its own vector/data. As
> > >> pointed out in a previous message, this also gives you the flexibility
> > to
> > >> implement cardinality increasing/reducing operations, but we don’t use
> > it
> > >> for that purpose. Operations like filtering, joining, unnesting and
> > similar
> > >> are done by wrapping the internal vector in a dictionary, as these
> need
> > to
> > >> work not only on “ListViews” but on any data types with any encoding.
> > There
> > >> are more details on Section 4.2.1 in [1]
> > >>
> > >> Beyond this, it also gives function/kernel developers more flexibility
> > to
> > >> implement operations that manipulate Arrays/Maps. For example,
> > operations
> > >> that slice these containers can be implemented in a zero-copy manner
> by
> > >> just rearranging the lengths/offsets indices, without ever touching
> the
> > >> larger internal buffers. This is a similar motivation as for
> StringView
> > >> (think of substr(), trim(), and similar). One nice last property is
> that
> > >> this layout allows for overlapping ranges. This is something discussed
> > with
> > >> our ML people to allow deduping feature values in a tensor (which is
> > fairly
> > >> common), but not something we have leveraged just yet.
> > >>
> > >> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > >>
> > >> Best,
> > >> --
> > >> Pedro Pedreira
> > >> 
> > >> From: Felipe Oliveira Carvalho 
> > >> Sent: Friday, May 19, 2023 10:01 AM
> > >> To: dev@arrow.apache.org 
> > >> Cc: Pedro Eugenio Rocha Pedreira 
> > >> Subject: Re: [DISCUSS][Format] Starting the draft implementation of
> the
> > >> ArrayView array format
> > >>
> > >> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  > >> taylordavies@ googlemail. com. invalid> wrote: Hi All, > if we added
> > >> this, do we think many Arrow and query > engine implementations (for
> > >> example, DataFusion) will be
> > >> ZjQcmQRYFpfptBannerStart
> > >> This Message Is From an External Sender
> > >>
> > >> ZjQcmQRYFpfptBannerEnd
> > >> +pedroerp
> > >>
> > >> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > >>  wrote:
> > >> Hi All,
> > >>
> > >>> if we added this, do we think many Arrow and query
> > >>> engine implementations (for example, DataFusion) will be eager to add
> > >> full
> > >>> support for the type, including compute kernels? Or are they likely
> to
> > >> just
> > >>> convert this type to ListArray at import boundaries?
> > >> I can't speak for query engines in general, but at least for arrow-rs
> > >> and by extension DataFusion, and based on my

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Felipe Oliveira Carvalho
As
> >> pointed out in a previous message, this also gives you the flexibility
> to
> >> implement cardinality increasing/reducing operations, but we don’t use
> it
> >> for that purpose. Operations like filtering, joining, unnesting and
> similar
> >> are done by wrapping the internal vector in a dictionary, as these need
> to
> >> work not only on “ListViews” but on any data types with any encoding.
> There
> >> are more details on Section 4.2.1 in [1]
> >>
> >> Beyond this, it also gives function/kernel developers more flexibility
> to
> >> implement operations that manipulate Arrays/Maps. For example,
> operations
> >> that slice these containers can be implemented in a zero-copy manner by
> >> just rearranging the lengths/offsets indices, without ever touching the
> >> larger internal buffers. This is a similar motivation as for StringView
> >> (think of substr(), trim(), and similar). One nice last property is that
> >> this layout allows for overlapping ranges. This is something discussed
> with
> >> our ML people to allow deduping feature values in a tensor (which is
> fairly
> >> common), but not something we have leveraged just yet.
> >>
> >> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> >>
> >> Best,
> >> --
> >> Pedro Pedreira
> >> 
> >> From: Felipe Oliveira Carvalho 
> >> Sent: Friday, May 19, 2023 10:01 AM
> >> To: dev@arrow.apache.org 
> >> Cc: Pedro Eugenio Rocha Pedreira 
> >> Subject: Re: [DISCUSS][Format] Starting the draft implementation of the
> >> ArrayView array format
> >>
> >> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  >> taylordavies@ googlemail. com. invalid> wrote: Hi All, > if we added
> >> this, do we think many Arrow and query > engine implementations (for
> >> example, DataFusion) will be
> >> ZjQcmQRYFpfptBannerStart
> >> This Message Is From an External Sender
> >>
> >> ZjQcmQRYFpfptBannerEnd
> >> +pedroerp
> >>
> >> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> >>  wrote:
> >> Hi All,
> >>
> >>> if we added this, do we think many Arrow and query
> >>> engine implementations (for example, DataFusion) will be eager to add
> >> full
> >>> support for the type, including compute kernels? Or are they likely to
> >> just
> >>> convert this type to ListArray at import boundaries?
> >> I can't speak for query engines in general, but at least for arrow-rs
> >> and by extension DataFusion, and based on my current understanding of
> >> the use-cases I would be rather hesitant to add support to the kernels
> >> for this array type, definitely instead favouring conversion at the
> >> edges. We already have issues with the amount of code generation
> >> resulting in binary bloat and long compile times, and I worry this would
> >> worsen this situation whilst not really providing compelling advantages
> >> for the vast majority of workloads that don't interact with Velox.
> >> Whilst I can definitely see that the ListView representation is probably
> >> a better way to represent variable length lists than what arrow settled
> >> upon, I'm not yet convinced it is sufficiently better to incentivise
> >> broad ecosystem adoption.
> >>
> >> Kind Regards,
> >>
> >> Raphael Taylor-Davies
> >>
> >>> On 11/05/2023 21:20, Will Jones wrote:
> >>> Hi Felipe,
> >>>
> >>> Thanks for the additional details.
> >>>
> >>>
> >>>> Velox kernels benefit from being able to append data to the array from
> >>>> different threads without care for strict ordering. Only the offsets
> >> array
> >>>> has to be written according to logical order but that is potentially a
> >> much
> >>>> smaller buffer than the values buffer.
> >>>>
> >>> It still seems to me like applications are still pretty niche, as I
> >> suspect
> >>> in most cases the benefits are outweighed by the costs. The benefit
> here
> >>> seems pretty limited: if you are trying to split work between threads,
> >>> usually you will have other levels such as array chunks to parallelize.
> >> And
> >>> if you have an incoming stream of row data, you'll want to append in
> >>> predictable orde

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Aldrin
   I don't feel like this representation is necessarily a detail of the query engine, but I am also not sure why this representation would have to be converted to a non-view format when serializing. Could you clarify that? My impression is that this representation could be used for persistence or data transfer, though it can be more complex to guarantee the portion of the buffer that an index points to is also present in memory. Sent from Proton Mail for iOS On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <krassovskysa...@gmail.com> wrote:  Hi everyone,I understand that there are numerous benefits to this representation during query processing, but would it be fair to say that this is an implementation detail of the query engine? Query engines don’t necessarily need to conform to the Arrow format internally, only at ingest/egress points, and performing a conversion from the non-view to view format seems like it would be very cheap (though I understand not necessarily the other way around, but you’d need to do that anyway if you’re serializing).Sasha Krassovsky> 20 мая 2023 г., в 13:00, Will Jones  написал(а):>> Thanks for sharing these details, Pedro. The conditional branches argument> makes a lot of sense to me.>> The tensors point brings up some interesting issues. For now, we've defined> our only tensor extension type to be built on a fixed size list. If a use> case of this might be manipulating tensors with zero copy, perhaps that> suggests that we want a fixed size list variant? In addition, would we have> to define another extension type to be a ListView variant? Or would we want> to think about making extension types somehow valid across various> encodings of the same "logical type"?>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira>>  wrote:>>>> Hi all,>>>> This is Pedro from the Velox team at Meta. This is my first time here, so>> nice to e-meet you!>>>> Adding to what Felipe said, the main reason we created “ListView” (though>> we just call them ArrayVector/MapVector in Velox) is that, along with>> StringViews for strings, they allow us to write any columnar buffer>> out-or-order, regardless of their types or encodings. This is naturally>> doable for all primitive types (fixed-size), but not for types that don’t>> have fixed size and are required to be contiguous. The StringView and>> ListView formats allow us to keep this invariant in Velox.>>>> Being able to write vectors out-of-order is useful when executing>> conditionals like IF/SWITCH statements, which are pervasive among our>> workloads. To fully vectorize it, one first evaluates the _expression_, then>> generate a bitmap containing which rows take the THEN and which take the>> ELSE branch. Then you populate all rows that match the first branch by>> evaluating the THEN _expression_ in a vectorized (branch-less and cache>> friendly) way, and subsequently the ELSE branch. If you can’t write them>> out-of-order, you would either have a big branch per row dispatching to the>> right _expression_ (slow), or populate two distinct vectors then merging them>> at the end (potentially even slower). How much faster our approach is>> highly depends on the buffer sizes and expressions, but we found it to be>> faster enough on average to justify us extending the underlying layout.>>>> With that said, this is all within a single thread of execution.>> Parallelization is done by feeding each thread its own vector/data. As>> pointed out in a previous message, this also gives you the flexibility to>> implement cardinality increasing/reducing operations, but we don’t use it>> for that purpose. Operations like filtering, joining, unnesting and similar>> are done by wrapping the internal vector in a dictionary, as these need to>> work not only on “ListViews” but on any data types with any encoding. There>> are more details on Section 4.2.1 in [1]>>>> Beyond this, it also gives function/kernel developers more flexibility to>> implement operations that manipulate Arrays/Maps. For example, operations>> that slice these containers can be implemented in a zero-copy manner by>> just rearranging the lengths/offsets indices, without ever touching the>> larger internal buffers. This is a similar motivation as for StringView>> (think of substr(), trim(), and similar). One nice last property is that>> this layout allows for overlapping ranges. This is something discussed with>> our ML people to allow deduping feature values in a tensor (which is fairly>> common), but not something we have leveraged just yet.>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf>>>> Best,>> -->> Pedro

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Sasha Krassovsky
Hi everyone,
I understand that there are numerous benefits to this representation during 
query processing, but would it be fair to say that this is an implementation 
detail of the query engine? Query engines don’t necessarily need to conform to 
the Arrow format internally, only at ingest/egress points, and performing a 
conversion from the non-view to view format seems like it would be very cheap 
(though I understand not necessarily the other way around, but you’d need to do 
that anyway if you’re serializing). 

Sasha Krassovsky

> 20 мая 2023 г., в 13:00, Will Jones  написал(а):
> 
> Thanks for sharing these details, Pedro. The conditional branches argument
> makes a lot of sense to me.
> 
> The tensors point brings up some interesting issues. For now, we've defined
> our only tensor extension type to be built on a fixed size list. If a use
> case of this might be manipulating tensors with zero copy, perhaps that
> suggests that we want a fixed size list variant? In addition, would we have
> to define another extension type to be a ListView variant? Or would we want
> to think about making extension types somehow valid across various
> encodings of the same "logical type"?
> 
>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
>>  wrote:
>> 
>> Hi all,
>> 
>> This is Pedro from the Velox team at Meta. This is my first time here, so
>> nice to e-meet you!
>> 
>> Adding to what Felipe said, the main reason we created “ListView” (though
>> we just call them ArrayVector/MapVector in Velox) is that, along with
>> StringViews for strings, they allow us to write any columnar buffer
>> out-or-order, regardless of their types or encodings. This is naturally
>> doable for all primitive types (fixed-size), but not for types that don’t
>> have fixed size and are required to be contiguous. The StringView and
>> ListView formats allow us to keep this invariant in Velox.
>> 
>> Being able to write vectors out-of-order is useful when executing
>> conditionals like IF/SWITCH statements, which are pervasive among our
>> workloads. To fully vectorize it, one first evaluates the expression, then
>> generate a bitmap containing which rows take the THEN and which take the
>> ELSE branch. Then you populate all rows that match the first branch by
>> evaluating the THEN expression in a vectorized (branch-less and cache
>> friendly) way, and subsequently the ELSE branch. If you can’t write them
>> out-of-order, you would either have a big branch per row dispatching to the
>> right expression (slow), or populate two distinct vectors then merging them
>> at the end (potentially even slower). How much faster our approach is
>> highly depends on the buffer sizes and expressions, but we found it to be
>> faster enough on average to justify us extending the underlying layout.
>> 
>> With that said, this is all within a single thread of execution.
>> Parallelization is done by feeding each thread its own vector/data. As
>> pointed out in a previous message, this also gives you the flexibility to
>> implement cardinality increasing/reducing operations, but we don’t use it
>> for that purpose. Operations like filtering, joining, unnesting and similar
>> are done by wrapping the internal vector in a dictionary, as these need to
>> work not only on “ListViews” but on any data types with any encoding. There
>> are more details on Section 4.2.1 in [1]
>> 
>> Beyond this, it also gives function/kernel developers more flexibility to
>> implement operations that manipulate Arrays/Maps. For example, operations
>> that slice these containers can be implemented in a zero-copy manner by
>> just rearranging the lengths/offsets indices, without ever touching the
>> larger internal buffers. This is a similar motivation as for StringView
>> (think of substr(), trim(), and similar). One nice last property is that
>> this layout allows for overlapping ranges. This is something discussed with
>> our ML people to allow deduping feature values in a tensor (which is fairly
>> common), but not something we have leveraged just yet.
>> 
>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>> 
>> Best,
>> --
>> Pedro Pedreira
>> 
>> From: Felipe Oliveira Carvalho 
>> Sent: Friday, May 19, 2023 10:01 AM
>> To: dev@arrow.apache.org 
>> Cc: Pedro Eugenio Rocha Pedreira 
>> Subject: Re: [DISCUSS][Format] Starting the draft implementation of the
>> ArrayView array format
>> 
>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies > taylordavies@ googlemail. com. invalid> wrote: Hi 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Will Jones
Thanks for sharing these details, Pedro. The conditional branches argument
makes a lot of sense to me.

The tensors point brings up some interesting issues. For now, we've defined
our only tensor extension type to be built on a fixed size list. If a use
case of this might be manipulating tensors with zero copy, perhaps that
suggests that we want a fixed size list variant? In addition, would we have
to define another extension type to be a ListView variant? Or would we want
to think about making extension types somehow valid across various
encodings of the same "logical type"?

On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
 wrote:

> Hi all,
>
> This is Pedro from the Velox team at Meta. This is my first time here, so
> nice to e-meet you!
>
> Adding to what Felipe said, the main reason we created “ListView” (though
> we just call them ArrayVector/MapVector in Velox) is that, along with
> StringViews for strings, they allow us to write any columnar buffer
> out-or-order, regardless of their types or encodings. This is naturally
> doable for all primitive types (fixed-size), but not for types that don’t
> have fixed size and are required to be contiguous. The StringView and
> ListView formats allow us to keep this invariant in Velox.
>
> Being able to write vectors out-of-order is useful when executing
> conditionals like IF/SWITCH statements, which are pervasive among our
> workloads. To fully vectorize it, one first evaluates the expression, then
> generate a bitmap containing which rows take the THEN and which take the
> ELSE branch. Then you populate all rows that match the first branch by
> evaluating the THEN expression in a vectorized (branch-less and cache
> friendly) way, and subsequently the ELSE branch. If you can’t write them
> out-of-order, you would either have a big branch per row dispatching to the
> right expression (slow), or populate two distinct vectors then merging them
> at the end (potentially even slower). How much faster our approach is
> highly depends on the buffer sizes and expressions, but we found it to be
> faster enough on average to justify us extending the underlying layout.
>
> With that said, this is all within a single thread of execution.
> Parallelization is done by feeding each thread its own vector/data. As
> pointed out in a previous message, this also gives you the flexibility to
> implement cardinality increasing/reducing operations, but we don’t use it
> for that purpose. Operations like filtering, joining, unnesting and similar
> are done by wrapping the internal vector in a dictionary, as these need to
> work not only on “ListViews” but on any data types with any encoding. There
> are more details on Section 4.2.1 in [1]
>
> Beyond this, it also gives function/kernel developers more flexibility to
> implement operations that manipulate Arrays/Maps. For example, operations
> that slice these containers can be implemented in a zero-copy manner by
> just rearranging the lengths/offsets indices, without ever touching the
> larger internal buffers. This is a similar motivation as for StringView
> (think of substr(), trim(), and similar). One nice last property is that
> this layout allows for overlapping ranges. This is something discussed with
> our ML people to allow deduping feature values in a tensor (which is fairly
> common), but not something we have leveraged just yet.
>
> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>
> Best,
> --
> Pedro Pedreira
> ________
> From: Felipe Oliveira Carvalho 
> Sent: Friday, May 19, 2023 10:01 AM
> To: dev@arrow.apache.org 
> Cc: Pedro Eugenio Rocha Pedreira 
> Subject: Re: [DISCUSS][Format] Starting the draft implementation of the
> ArrayView array format
>
> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  taylordavies@ googlemail. com. invalid> wrote: Hi All, > if we added
> this, do we think many Arrow and query > engine implementations (for
> example, DataFusion) will be
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
>
> ZjQcmQRYFpfptBannerEnd
> +pedroerp
>
> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
>  wrote:
> Hi All,
>
> > if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> I can't speak for query engines in general, but at least for arrow-rs
> and by extension DataFusion, and based on my current understanding of
> the use-cases I would be rather hesitant to add support to the kernels
> for this array type, definitely instead favour

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Pedro Eugenio Rocha Pedreira
Hi all,

This is Pedro from the Velox team at Meta. This is my first time here, so nice 
to e-meet you!

Adding to what Felipe said, the main reason we created “ListView” (though we 
just call them ArrayVector/MapVector in Velox) is that, along with StringViews 
for strings, they allow us to write any columnar buffer out-or-order, 
regardless of their types or encodings. This is naturally doable for all 
primitive types (fixed-size), but not for types that don’t have fixed size and 
are required to be contiguous. The StringView and ListView formats allow us to 
keep this invariant in Velox.

Being able to write vectors out-of-order is useful when executing conditionals 
like IF/SWITCH statements, which are pervasive among our workloads. To fully 
vectorize it, one first evaluates the expression, then generate a bitmap 
containing which rows take the THEN and which take the ELSE branch. Then you 
populate all rows that match the first branch by evaluating the THEN expression 
in a vectorized (branch-less and cache friendly) way, and subsequently the ELSE 
branch. If you can’t write them out-of-order, you would either have a big 
branch per row dispatching to the right expression (slow), or populate two 
distinct vectors then merging them at the end (potentially even slower). How 
much faster our approach is highly depends on the buffer sizes and expressions, 
but we found it to be faster enough on average to justify us extending the 
underlying layout.

With that said, this is all within a single thread of execution. 
Parallelization is done by feeding each thread its own vector/data. As pointed 
out in a previous message, this also gives you the flexibility to implement 
cardinality increasing/reducing operations, but we don’t use it for that 
purpose. Operations like filtering, joining, unnesting and similar are done by 
wrapping the internal vector in a dictionary, as these need to work not only on 
“ListViews” but on any data types with any encoding. There are more details on 
Section 4.2.1 in [1]

Beyond this, it also gives function/kernel developers more flexibility to 
implement operations that manipulate Arrays/Maps. For example, operations that 
slice these containers can be implemented in a zero-copy manner by just 
rearranging the lengths/offsets indices, without ever touching the larger 
internal buffers. This is a similar motivation as for StringView (think of 
substr(), trim(), and similar). One nice last property is that this layout 
allows for overlapping ranges. This is something discussed with our ML people 
to allow deduping feature values in a tensor (which is fairly common), but not 
something we have leveraged just yet.

[1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

Best,
--
Pedro Pedreira

From: Felipe Oliveira Carvalho 
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org 
Cc: Pedro Eugenio Rocha Pedreira 
Subject: Re: [DISCUSS][Format] Starting the draft implementation of the 
ArrayView array format

+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies  wrote: Hi All, > if we added this, do we think many 
Arrow and query > engine implementations (for example, DataFusion) will be
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies 
 wrote:
Hi All,

> if we added this, do we think many Arrow and query
> engine implementations (for example, DataFusion) will be eager to add full
> support for the type, including compute kernels? Or are they likely to just
> convert this type to ListArray at import boundaries?
I can't speak for query engines in general, but at least for arrow-rs
and by extension DataFusion, and based on my current understanding of
the use-cases I would be rather hesitant to add support to the kernels
for this array type, definitely instead favouring conversion at the
edges. We already have issues with the amount of code generation
resulting in binary bloat and long compile times, and I worry this would
worsen this situation whilst not really providing compelling advantages
for the vast majority of workloads that don't interact with Velox.
Whilst I can definitely see that the ListView representation is probably
a better way to represent variable length lists than what arrow settled
upon, I'm not yet convinced it is sufficiently better to incentivise
broad ecosystem adoption.

Kind Regards,

Raphael Taylor-Davies

On 11/05/2023 21:20, Will Jones wrote:
> Hi Felipe,
>
> Thanks for the additional details.
>
>
>> Velox kernels benefit from being able to append data to the array from
>> different threads without care for strict ordering. Only the offsets array
>> has to be written according to logical order but that is potentially a much
>> smaller buffer than the values buffer.
>>
> It still seems to me like applications are still pretty niche, as 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Felipe Oliveira Carvalho
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
 wrote:

> Hi All,
>
> > if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> I can't speak for query engines in general, but at least for arrow-rs
> and by extension DataFusion, and based on my current understanding of
> the use-cases I would be rather hesitant to add support to the kernels
> for this array type, definitely instead favouring conversion at the
> edges. We already have issues with the amount of code generation
> resulting in binary bloat and long compile times, and I worry this would
> worsen this situation whilst not really providing compelling advantages
> for the vast majority of workloads that don't interact with Velox.
> Whilst I can definitely see that the ListView representation is probably
> a better way to represent variable length lists than what arrow settled
> upon, I'm not yet convinced it is sufficiently better to incentivise
> broad ecosystem adoption.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 11/05/2023 21:20, Will Jones wrote:
> > Hi Felipe,
> >
> > Thanks for the additional details.
> >
> >
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> > It still seems to me like applications are still pretty niche, as I
> suspect
> > in most cases the benefits are outweighed by the costs. The benefit here
> > seems pretty limited: if you are trying to split work between threads,
> > usually you will have other levels such as array chunks to parallelize.
> And
> > if you have an incoming stream of row data, you'll want to append in
> > predictable order to match the order of the other arrays. Am I missing
> > something?
> >
> > And, IIUC, the cost of using ListView with out-of-order values over
> > ListArray is you lose memory locality; the values of element 2 are no
> > longer adjacent to the values of element 1. What do you think about that
> > tradeoff?
> >
> > I don't mean to be difficult about this. I'm excited for both the REE and
> > StringView arrays, but this one I'm not so sure about yet. I suppose
> what I
> > am trying to ask is, if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> >
> > Because if it turns out to be the latter, then we might as well ask Velox
> > to export this type as ListArray and save the rest of the ecosystem some
> > work.
> >
> > Best,
> >
> > Will Jones
> >
> > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> >> Initial reason for ListView arrays in Arrow is zero-copy compatibility
> with
> >> Velox which uses this format.
> >>
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> >> Acero kernels could take advantage of that in the future.
> >>
> >> In implementing ListViewArray/Type I was able to reuse some C++
> templates
> >> used for ListArray which can reduce some of the burden on kernel
> >> implementations that aim to work with all the types.
> >>
> >> I’m can fix Acero kernels for working with ListView. This is similar to
> the
> >> work I’ve doing in kernels dealing with run-end encoded arrays.
> >>
> >> —
> >> Felipe
> >>
> >>
> >> On Wed, 26 Apr 2023 at 01:03 Will Jones 
> wrote:
> >>
> >>> I suppose one common use case is materializing list columns after some
> >>> expanding operation like a join or unnest. That's a case where I could
> >>> imagine a lot of repetition of values. Haven't yet thought of common
> >> cases
> >>> where there is overlap but not full duplication, but am eager to hear
> >> any.
> >>> The dictionary encoding point Raphael makes is interesting, especially
> >>> given the existence of LargeList and FixedSizeList. For many
> operations,
> >> it
> >>> might make more sense to just compose those existing types.
> >>>
> >>> IIUC the operations that would be unique to the ArrayView are ones
> >> altering
> >>> the shape. One could truncate each array to a certain length cheaply
> >> simply
> >>> by replacing the sizes buffer. Or perhaps there are interesting
> >> operations
> >>> on tensors that would benefit.
> >>>
> >>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> >>>  wrote:
> >>>
> 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Ian Cook
That's great, thanks Brent. If possible could you share a specific
example of the operation you are referring to so that we can better
reason about how the ListView layout would help in this case?

Any additional input from the community providing specifics of
real-world workloads that are expected to benefit from the ListView
layout would be much appreciated.
.
Ian

On Mon, May 15, 2023 at 5:12 PM Brent Gardner
 wrote:
>
> For what it's worth, my company is building a database using arrow(rs) as
> an in memory storage format, and this feature would be very helpful because
> it would allow us to bitmask out mvcc rows that have been deleted / have
> not yet been committed / have been rolled back, etc.
>
> - Brent
>
> On Mon, May 15, 2023, 06:55 Ian Cook  wrote:
>
> > I think it would be easier for us all to weigh the costs and benefits
> > of adding this proposed ListView layout to the Arrow specification and
> > implementing it in the various Arrow libraries if we could all see
> > some benchmarks demonstrating the performance/efficiency benefits
> > compared to Arrow’s existing List layouts.
> >
> > Based on the Velox paper [1] and from conversations with the Velox
> > developers, I would anticipate that these benchmarks will show that
> > ListView confers substantial performance/efficiency benefits on some
> > workloads. I suggest conferring with the Velox developers to identify
> > benchmark workloads will best demonstrate the performance/efficiency
> > benefit of the ListView layout while representing common real-world
> > workloads.
> >
> > Ian
> >
> > [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> >
> > On Sat, May 13, 2023 at 3:09 PM Andrew Lamb  wrote:
> > >
> > > I agree that it is hard to see  any compelling advantage of adopting
> > > ListView that would incentivize adding it to DataFusion.
> > >
> > > It also seems like the conversion requires changing only indexes (not the
> > > underlying data) so it would likely be relatively inexpensive I would
> > think
> > >
> > > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
> > >  wrote:
> > >
> > > > Hi All,
> > > >
> > > > > if we added this, do we think many Arrow and query
> > > > > engine implementations (for example, DataFusion) will be eager to add
> > > > full
> > > > > support for the type, including compute kernels? Or are they likely
> > to
> > > > just
> > > > > convert this type to ListArray at import boundaries?
> > > > I can't speak for query engines in general, but at least for arrow-rs
> > > > and by extension DataFusion, and based on my current understanding of
> > > > the use-cases I would be rather hesitant to add support to the kernels
> > > > for this array type, definitely instead favouring conversion at the
> > > > edges. We already have issues with the amount of code generation
> > > > resulting in binary bloat and long compile times, and I worry this
> > would
> > > > worsen this situation whilst not really providing compelling advantages
> > > > for the vast majority of workloads that don't interact with Velox.
> > > > Whilst I can definitely see that the ListView representation is
> > probably
> > > > a better way to represent variable length lists than what arrow settled
> > > > upon, I'm not yet convinced it is sufficiently better to incentivise
> > > > broad ecosystem adoption.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael Taylor-Davies
> > > >
> > > > On 11/05/2023 21:20, Will Jones wrote:
> > > > > Hi Felipe,
> > > > >
> > > > > Thanks for the additional details.
> > > > >
> > > > >
> > > > >> Velox kernels benefit from being able to append data to the array
> > from
> > > > >> different threads without care for strict ordering. Only the offsets
> > > > array
> > > > >> has to be written according to logical order but that is
> > potentially a
> > > > much
> > > > >> smaller buffer than the values buffer.
> > > > >>
> > > > > It still seems to me like applications are still pretty niche, as I
> > > > suspect
> > > > > in most cases the benefits are outweighed by the costs. The benefit
> > here
> > > > > seems pretty limited: if you are trying to split work between
> > threads,
> > > > > usually you will have other levels such as array chunks to
> > parallelize.
> > > > And
> > > > > if you have an incoming stream of row data, you'll want to append in
> > > > > predictable order to match the order of the other arrays. Am I
> > missing
> > > > > something?
> > > > >
> > > > > And, IIUC, the cost of using ListView with out-of-order values over
> > > > > ListArray is you lose memory locality; the values of element 2 are no
> > > > > longer adjacent to the values of element 1. What do you think about
> > that
> > > > > tradeoff?
> > > > >
> > > > > I don't mean to be difficult about this. I'm excited for both the
> > REE and
> > > > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > > > what I
> > > > > am trying to ask is, if we added this, do we think many Arrow and
> > query

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-15 Thread Brent Gardner
For what it's worth, my company is building a database using arrow(rs) as
an in memory storage format, and this feature would be very helpful because
it would allow us to bitmask out mvcc rows that have been deleted / have
not yet been committed / have been rolled back, etc.

- Brent

On Mon, May 15, 2023, 06:55 Ian Cook  wrote:

> I think it would be easier for us all to weigh the costs and benefits
> of adding this proposed ListView layout to the Arrow specification and
> implementing it in the various Arrow libraries if we could all see
> some benchmarks demonstrating the performance/efficiency benefits
> compared to Arrow’s existing List layouts.
>
> Based on the Velox paper [1] and from conversations with the Velox
> developers, I would anticipate that these benchmarks will show that
> ListView confers substantial performance/efficiency benefits on some
> workloads. I suggest conferring with the Velox developers to identify
> benchmark workloads will best demonstrate the performance/efficiency
> benefit of the ListView layout while representing common real-world
> workloads.
>
> Ian
>
> [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>
> On Sat, May 13, 2023 at 3:09 PM Andrew Lamb  wrote:
> >
> > I agree that it is hard to see  any compelling advantage of adopting
> > ListView that would incentivize adding it to DataFusion.
> >
> > It also seems like the conversion requires changing only indexes (not the
> > underlying data) so it would likely be relatively inexpensive I would
> think
> >
> > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
> >  wrote:
> >
> > > Hi All,
> > >
> > > > if we added this, do we think many Arrow and query
> > > > engine implementations (for example, DataFusion) will be eager to add
> > > full
> > > > support for the type, including compute kernels? Or are they likely
> to
> > > just
> > > > convert this type to ListArray at import boundaries?
> > > I can't speak for query engines in general, but at least for arrow-rs
> > > and by extension DataFusion, and based on my current understanding of
> > > the use-cases I would be rather hesitant to add support to the kernels
> > > for this array type, definitely instead favouring conversion at the
> > > edges. We already have issues with the amount of code generation
> > > resulting in binary bloat and long compile times, and I worry this
> would
> > > worsen this situation whilst not really providing compelling advantages
> > > for the vast majority of workloads that don't interact with Velox.
> > > Whilst I can definitely see that the ListView representation is
> probably
> > > a better way to represent variable length lists than what arrow settled
> > > upon, I'm not yet convinced it is sufficiently better to incentivise
> > > broad ecosystem adoption.
> > >
> > > Kind Regards,
> > >
> > > Raphael Taylor-Davies
> > >
> > > On 11/05/2023 21:20, Will Jones wrote:
> > > > Hi Felipe,
> > > >
> > > > Thanks for the additional details.
> > > >
> > > >
> > > >> Velox kernels benefit from being able to append data to the array
> from
> > > >> different threads without care for strict ordering. Only the offsets
> > > array
> > > >> has to be written according to logical order but that is
> potentially a
> > > much
> > > >> smaller buffer than the values buffer.
> > > >>
> > > > It still seems to me like applications are still pretty niche, as I
> > > suspect
> > > > in most cases the benefits are outweighed by the costs. The benefit
> here
> > > > seems pretty limited: if you are trying to split work between
> threads,
> > > > usually you will have other levels such as array chunks to
> parallelize.
> > > And
> > > > if you have an incoming stream of row data, you'll want to append in
> > > > predictable order to match the order of the other arrays. Am I
> missing
> > > > something?
> > > >
> > > > And, IIUC, the cost of using ListView with out-of-order values over
> > > > ListArray is you lose memory locality; the values of element 2 are no
> > > > longer adjacent to the values of element 1. What do you think about
> that
> > > > tradeoff?
> > > >
> > > > I don't mean to be difficult about this. I'm excited for both the
> REE and
> > > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > > what I
> > > > am trying to ask is, if we added this, do we think many Arrow and
> query
> > > > engine implementations (for example, DataFusion) will be eager to add
> > > full
> > > > support for the type, including compute kernels? Or are they likely
> to
> > > just
> > > > convert this type to ListArray at import boundaries?
> > > >
> > > > Because if it turns out to be the latter, then we might as well ask
> Velox
> > > > to export this type as ListArray and save the rest of the ecosystem
> some
> > > > work.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > > > felipe...@gmail.com> wrote:
> > > >
> > > >> Initial reason for ListView 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-15 Thread Ian Cook
I think it would be easier for us all to weigh the costs and benefits
of adding this proposed ListView layout to the Arrow specification and
implementing it in the various Arrow libraries if we could all see
some benchmarks demonstrating the performance/efficiency benefits
compared to Arrow’s existing List layouts.

Based on the Velox paper [1] and from conversations with the Velox
developers, I would anticipate that these benchmarks will show that
ListView confers substantial performance/efficiency benefits on some
workloads. I suggest conferring with the Velox developers to identify
benchmark workloads will best demonstrate the performance/efficiency
benefit of the ListView layout while representing common real-world
workloads.

Ian

[1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

On Sat, May 13, 2023 at 3:09 PM Andrew Lamb  wrote:
>
> I agree that it is hard to see  any compelling advantage of adopting
> ListView that would incentivize adding it to DataFusion.
>
> It also seems like the conversion requires changing only indexes (not the
> underlying data) so it would likely be relatively inexpensive I would think
>
> On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi All,
> >
> > > if we added this, do we think many Arrow and query
> > > engine implementations (for example, DataFusion) will be eager to add
> > full
> > > support for the type, including compute kernels? Or are they likely to
> > just
> > > convert this type to ListArray at import boundaries?
> > I can't speak for query engines in general, but at least for arrow-rs
> > and by extension DataFusion, and based on my current understanding of
> > the use-cases I would be rather hesitant to add support to the kernels
> > for this array type, definitely instead favouring conversion at the
> > edges. We already have issues with the amount of code generation
> > resulting in binary bloat and long compile times, and I worry this would
> > worsen this situation whilst not really providing compelling advantages
> > for the vast majority of workloads that don't interact with Velox.
> > Whilst I can definitely see that the ListView representation is probably
> > a better way to represent variable length lists than what arrow settled
> > upon, I'm not yet convinced it is sufficiently better to incentivise
> > broad ecosystem adoption.
> >
> > Kind Regards,
> >
> > Raphael Taylor-Davies
> >
> > On 11/05/2023 21:20, Will Jones wrote:
> > > Hi Felipe,
> > >
> > > Thanks for the additional details.
> > >
> > >
> > >> Velox kernels benefit from being able to append data to the array from
> > >> different threads without care for strict ordering. Only the offsets
> > array
> > >> has to be written according to logical order but that is potentially a
> > much
> > >> smaller buffer than the values buffer.
> > >>
> > > It still seems to me like applications are still pretty niche, as I
> > suspect
> > > in most cases the benefits are outweighed by the costs. The benefit here
> > > seems pretty limited: if you are trying to split work between threads,
> > > usually you will have other levels such as array chunks to parallelize.
> > And
> > > if you have an incoming stream of row data, you'll want to append in
> > > predictable order to match the order of the other arrays. Am I missing
> > > something?
> > >
> > > And, IIUC, the cost of using ListView with out-of-order values over
> > > ListArray is you lose memory locality; the values of element 2 are no
> > > longer adjacent to the values of element 1. What do you think about that
> > > tradeoff?
> > >
> > > I don't mean to be difficult about this. I'm excited for both the REE and
> > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > what I
> > > am trying to ask is, if we added this, do we think many Arrow and query
> > > engine implementations (for example, DataFusion) will be eager to add
> > full
> > > support for the type, including compute kernels? Or are they likely to
> > just
> > > convert this type to ListArray at import boundaries?
> > >
> > > Because if it turns out to be the latter, then we might as well ask Velox
> > > to export this type as ListArray and save the rest of the ecosystem some
> > > work.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com> wrote:
> > >
> > >> Initial reason for ListView arrays in Arrow is zero-copy compatibility
> > with
> > >> Velox which uses this format.
> > >>
> > >> Velox kernels benefit from being able to append data to the array from
> > >> different threads without care for strict ordering. Only the offsets
> > array
> > >> has to be written according to logical order but that is potentially a
> > much
> > >> smaller buffer than the values buffer.
> > >>
> > >> Acero kernels could take advantage of that in the future.
> > >>
> > >> In implementing ListViewArray/Type I was able to reuse some C++
> > templates
> > >> used 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-13 Thread Andrew Lamb
I agree that it is hard to see  any compelling advantage of adopting
ListView that would incentivize adding it to DataFusion.

It also seems like the conversion requires changing only indexes (not the
underlying data) so it would likely be relatively inexpensive I would think

On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
 wrote:

> Hi All,
>
> > if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> I can't speak for query engines in general, but at least for arrow-rs
> and by extension DataFusion, and based on my current understanding of
> the use-cases I would be rather hesitant to add support to the kernels
> for this array type, definitely instead favouring conversion at the
> edges. We already have issues with the amount of code generation
> resulting in binary bloat and long compile times, and I worry this would
> worsen this situation whilst not really providing compelling advantages
> for the vast majority of workloads that don't interact with Velox.
> Whilst I can definitely see that the ListView representation is probably
> a better way to represent variable length lists than what arrow settled
> upon, I'm not yet convinced it is sufficiently better to incentivise
> broad ecosystem adoption.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 11/05/2023 21:20, Will Jones wrote:
> > Hi Felipe,
> >
> > Thanks for the additional details.
> >
> >
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> > It still seems to me like applications are still pretty niche, as I
> suspect
> > in most cases the benefits are outweighed by the costs. The benefit here
> > seems pretty limited: if you are trying to split work between threads,
> > usually you will have other levels such as array chunks to parallelize.
> And
> > if you have an incoming stream of row data, you'll want to append in
> > predictable order to match the order of the other arrays. Am I missing
> > something?
> >
> > And, IIUC, the cost of using ListView with out-of-order values over
> > ListArray is you lose memory locality; the values of element 2 are no
> > longer adjacent to the values of element 1. What do you think about that
> > tradeoff?
> >
> > I don't mean to be difficult about this. I'm excited for both the REE and
> > StringView arrays, but this one I'm not so sure about yet. I suppose
> what I
> > am trying to ask is, if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> >
> > Because if it turns out to be the latter, then we might as well ask Velox
> > to export this type as ListArray and save the rest of the ecosystem some
> > work.
> >
> > Best,
> >
> > Will Jones
> >
> > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> >> Initial reason for ListView arrays in Arrow is zero-copy compatibility
> with
> >> Velox which uses this format.
> >>
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> >> Acero kernels could take advantage of that in the future.
> >>
> >> In implementing ListViewArray/Type I was able to reuse some C++
> templates
> >> used for ListArray which can reduce some of the burden on kernel
> >> implementations that aim to work with all the types.
> >>
> >> I’m can fix Acero kernels for working with ListView. This is similar to
> the
> >> work I’ve doing in kernels dealing with run-end encoded arrays.
> >>
> >> —
> >> Felipe
> >>
> >>
> >> On Wed, 26 Apr 2023 at 01:03 Will Jones 
> wrote:
> >>
> >>> I suppose one common use case is materializing list columns after some
> >>> expanding operation like a join or unnest. That's a case where I could
> >>> imagine a lot of repetition of values. Haven't yet thought of common
> >> cases
> >>> where there is overlap but not full duplication, but am eager to hear
> >> any.
> >>> The dictionary encoding point Raphael makes is interesting, especially
> >>> given the existence of LargeList and FixedSizeList. For many
> operations,
> >> it
> >>> might make more sense to just compose those existing types.
> >>>
> >>> IIUC the operations that would be unique to the ArrayView are ones
> >> altering
> >>> the shape. One could 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Raphael Taylor-Davies

Hi All,


if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?
I can't speak for query engines in general, but at least for arrow-rs 
and by extension DataFusion, and based on my current understanding of 
the use-cases I would be rather hesitant to add support to the kernels 
for this array type, definitely instead favouring conversion at the 
edges. We already have issues with the amount of code generation 
resulting in binary bloat and long compile times, and I worry this would 
worsen this situation whilst not really providing compelling advantages 
for the vast majority of workloads that don't interact with Velox. 
Whilst I can definitely see that the ListView representation is probably 
a better way to represent variable length lists than what arrow settled 
upon, I'm not yet convinced it is sufficiently better to incentivise 
broad ecosystem adoption.


Kind Regards,

Raphael Taylor-Davies

On 11/05/2023 21:20, Will Jones wrote:

Hi Felipe,

Thanks for the additional details.



Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order but that is potentially a much
smaller buffer than the values buffer.


It still seems to me like applications are still pretty niche, as I suspect
in most cases the benefits are outweighed by the costs. The benefit here
seems pretty limited: if you are trying to split work between threads,
usually you will have other levels such as array chunks to parallelize. And
if you have an incoming stream of row data, you'll want to append in
predictable order to match the order of the other arrays. Am I missing
something?

And, IIUC, the cost of using ListView with out-of-order values over
ListArray is you lose memory locality; the values of element 2 are no
longer adjacent to the values of element 1. What do you think about that
tradeoff?

I don't mean to be difficult about this. I'm excited for both the REE and
StringView arrays, but this one I'm not so sure about yet. I suppose what I
am trying to ask is, if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?

Because if it turns out to be the latter, then we might as well ask Velox
to export this type as ListArray and save the rest of the ecosystem some
work.

Best,

Will Jones

On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:


Initial reason for ListView arrays in Arrow is zero-copy compatibility with
Velox which uses this format.

Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order but that is potentially a much
smaller buffer than the values buffer.

Acero kernels could take advantage of that in the future.

In implementing ListViewArray/Type I was able to reuse some C++ templates
used for ListArray which can reduce some of the burden on kernel
implementations that aim to work with all the types.

I’m can fix Acero kernels for working with ListView. This is similar to the
work I’ve doing in kernels dealing with run-end encoded arrays.

—
Felipe


On Wed, 26 Apr 2023 at 01:03 Will Jones  wrote:


I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common

cases

where there is overlap but not full duplication, but am eager to hear

any.

The dictionary encoding point Raphael makes is interesting, especially
given the existence of LargeList and FixedSizeList. For many operations,

it

might make more sense to just compose those existing types.

IIUC the operations that would be unique to the ArrayView are ones

altering

the shape. One could truncate each array to a certain length cheaply

simply

by replacing the sizes buffer. Or perhaps there are interesting

operations

on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
 wrote:


Unless I am missing something, I think the selection use-case could be
equally well served by a dictionary-encoded BinarArray/ListArray, and

would

have the benefit of not requiring any modifications to the existing

format

or kernels.

The major additional flexibility of the proposed encoding would be
permitting disjoint or overlapping ranges, are these common enough in
practice to represent a meaningful bottleneck?


On 26 April 2023 01:40:14 BST, David Li  wrote:

Is there a need for a 64-bit offsets 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Will Jones
Hi Felipe,

Thanks for the additional details.


> Velox kernels benefit from being able to append data to the array from
> different threads without care for strict ordering. Only the offsets array
> has to be written according to logical order but that is potentially a much
> smaller buffer than the values buffer.
>

It still seems to me like applications are still pretty niche, as I suspect
in most cases the benefits are outweighed by the costs. The benefit here
seems pretty limited: if you are trying to split work between threads,
usually you will have other levels such as array chunks to parallelize. And
if you have an incoming stream of row data, you'll want to append in
predictable order to match the order of the other arrays. Am I missing
something?

And, IIUC, the cost of using ListView with out-of-order values over
ListArray is you lose memory locality; the values of element 2 are no
longer adjacent to the values of element 1. What do you think about that
tradeoff?

I don't mean to be difficult about this. I'm excited for both the REE and
StringView arrays, but this one I'm not so sure about yet. I suppose what I
am trying to ask is, if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?

Because if it turns out to be the latter, then we might as well ask Velox
to export this type as ListArray and save the rest of the ecosystem some
work.

Best,

Will Jones

On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Initial reason for ListView arrays in Arrow is zero-copy compatibility with
> Velox which uses this format.
>
> Velox kernels benefit from being able to append data to the array from
> different threads without care for strict ordering. Only the offsets array
> has to be written according to logical order but that is potentially a much
> smaller buffer than the values buffer.
>
> Acero kernels could take advantage of that in the future.
>
> In implementing ListViewArray/Type I was able to reuse some C++ templates
> used for ListArray which can reduce some of the burden on kernel
> implementations that aim to work with all the types.
>
> I’m can fix Acero kernels for working with ListView. This is similar to the
> work I’ve doing in kernels dealing with run-end encoded arrays.
>
> —
> Felipe
>
>
> On Wed, 26 Apr 2023 at 01:03 Will Jones  wrote:
>
> > I suppose one common use case is materializing list columns after some
> > expanding operation like a join or unnest. That's a case where I could
> > imagine a lot of repetition of values. Haven't yet thought of common
> cases
> > where there is overlap but not full duplication, but am eager to hear
> any.
> >
> > The dictionary encoding point Raphael makes is interesting, especially
> > given the existence of LargeList and FixedSizeList. For many operations,
> it
> > might make more sense to just compose those existing types.
> >
> > IIUC the operations that would be unique to the ArrayView are ones
> altering
> > the shape. One could truncate each array to a certain length cheaply
> simply
> > by replacing the sizes buffer. Or perhaps there are interesting
> operations
> > on tensors that would benefit.
> >
> > On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> >  wrote:
> >
> > > Unless I am missing something, I think the selection use-case could be
> > > equally well served by a dictionary-encoded BinarArray/ListArray, and
> > would
> > > have the benefit of not requiring any modifications to the existing
> > format
> > > or kernels.
> > >
> > > The major additional flexibility of the proposed encoding would be
> > > permitting disjoint or overlapping ranges, are these common enough in
> > > practice to represent a meaningful bottleneck?
> > >
> > >
> > > On 26 April 2023 01:40:14 BST, David Li  wrote:
> > > >Is there a need for a 64-bit offsets version the same way we have List
> > > and LargeList?
> > > >
> > > >And just to be clear, the difference with List is that the lists don't
> > > have to be stored in their logical order (or in other words, offsets do
> > not
> > > have to be nondecreasing and so we also need sizes)?
> > > >
> > > >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > >> For context, there was some discussion on this back in [1].  At that
> > > time
> > > >> this was called "sequence view" but I do not like that name.
> However,
> > > >> array-view array is a little confusing.  Given this is similar to
> list
> > > can
> > > >> we go with list-view array?
> > > >>
> > > >>> Thanks for the introduction. I'd be interested to hear about the
> > > >>> applications Velox has found for these vectors, and in what
> > situations
> > > >> they
> > > >>> are useful. This could be contrasted with the current ListArray
> > > >>> implementations.
> > > >>
> > > >> I believe one 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Felipe Oliveira Carvalho
Initial reason for ListView arrays in Arrow is zero-copy compatibility with
Velox which uses this format.

Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order but that is potentially a much
smaller buffer than the values buffer.

Acero kernels could take advantage of that in the future.

In implementing ListViewArray/Type I was able to reuse some C++ templates
used for ListArray which can reduce some of the burden on kernel
implementations that aim to work with all the types.

I’m can fix Acero kernels for working with ListView. This is similar to the
work I’ve doing in kernels dealing with run-end encoded arrays.

—
Felipe


On Wed, 26 Apr 2023 at 01:03 Will Jones  wrote:

> I suppose one common use case is materializing list columns after some
> expanding operation like a join or unnest. That's a case where I could
> imagine a lot of repetition of values. Haven't yet thought of common cases
> where there is overlap but not full duplication, but am eager to hear any.
>
> The dictionary encoding point Raphael makes is interesting, especially
> given the existence of LargeList and FixedSizeList. For many operations, it
> might make more sense to just compose those existing types.
>
> IIUC the operations that would be unique to the ArrayView are ones altering
> the shape. One could truncate each array to a certain length cheaply simply
> by replacing the sizes buffer. Or perhaps there are interesting operations
> on tensors that would benefit.
>
> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
>  wrote:
>
> > Unless I am missing something, I think the selection use-case could be
> > equally well served by a dictionary-encoded BinarArray/ListArray, and
> would
> > have the benefit of not requiring any modifications to the existing
> format
> > or kernels.
> >
> > The major additional flexibility of the proposed encoding would be
> > permitting disjoint or overlapping ranges, are these common enough in
> > practice to represent a meaningful bottleneck?
> >
> >
> > On 26 April 2023 01:40:14 BST, David Li  wrote:
> > >Is there a need for a 64-bit offsets version the same way we have List
> > and LargeList?
> > >
> > >And just to be clear, the difference with List is that the lists don't
> > have to be stored in their logical order (or in other words, offsets do
> not
> > have to be nondecreasing and so we also need sizes)?
> > >
> > >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > >> For context, there was some discussion on this back in [1].  At that
> > time
> > >> this was called "sequence view" but I do not like that name.  However,
> > >> array-view array is a little confusing.  Given this is similar to list
> > can
> > >> we go with list-view array?
> > >>
> > >>> Thanks for the introduction. I'd be interested to hear about the
> > >>> applications Velox has found for these vectors, and in what
> situations
> > >> they
> > >>> are useful. This could be contrasted with the current ListArray
> > >>> implementations.
> > >>
> > >> I believe one significant benefit is that take (and by proxy, filter)
> > and
> > >> sort are O(# of items) with the proposed format and O(# of bytes) with
> > the
> > >> current format.  Jorge did some profiling to this effect in [1].
> > >>
> > >> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > >>
> > >> On Tue, Apr 25, 2023 at 3:13 PM Will Jones 
> > wrote:
> > >>
> > >>> Hi Felipe,
> > >>>
> > >>> Thanks for the introduction. I'd be interested to hear about the
> > >>> applications Velox has found for these vectors, and in what
> situations
> > they
> > >>> are useful. This could be contrasted with the current ListArray
> > >>> implementations.
> > >>>
> > >>> IIUC it would be fairly cheap to transform a ListArray to an
> > ArrayView, but
> > >>> expensive to go the other way.
> > >>>
> > >>> Best,
> > >>>
> > >>> Will Jones
> > >>>
> > >>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> > >>> felipe...@gmail.com> wrote:
> > >>>
> > >>> > Hi folks,
> > >>> >
> > >>> > I would like to start a public discussion on the inclusion of a new
> > array
> > >>> > format to Arrow — array-view array. The name is also up for debate.
> > >>> >
> > >>> > This format is inspired by Velox's ArrayVector format [1].
> Logically,
> > >>> this
> > >>> > array represents an array of arrays. Each element is an array-view
> > >>> (offset
> > >>> > and size pair) that points to a range within a nested "values"
> array
> > >>> > (called "elements" in Velox docs). The nested array can be of any
> > type,
> > >>> > which makes this format very flexible and powerful.
> > >>> >
> > >>> > [image: ../_images/array-vector.png]
> > >>> > <
> https://facebookincubator.github.io/velox/_images/array-vector.png>
> > >>> >
> > >>> > I'm currently working on a C++ implementation and plan to work on a
> > Go
> > >>> > 

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-27 Thread Andrew Lamb
My apologies, I did not see the  thread [1] for some reason

[1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb

On Thu, Apr 27, 2023 at 10:32 AM Andrew Lamb  wrote:

> Felipe, thank you for bringing this up.
>
> Another approach that is sometimes used in database engines (like DuckDB)
> and is often called selection vectors, is to store another bitmask that
> says which elements in the array should be "selected" and which are ignored
> and functions like a view.
>
> For example, a  selection vector {0, 1, 1, 0, 1} would represent a view of
> the second and third and fifth rows
>
> I think the selection vector is as general as the ArrayVector format you
> describe, and likely simpler to implement (especially in compute kernels).
> The downside is that for very sparse selections on very large arrays, the
> size of the selection vector may be larger than the array view
>
> Have you considered such an approach?
>
> Andrew
>
> On Wed, Apr 26, 2023 at 1:27 AM wish maple  wrote:
>
>> I think the ArrayVector can have benefits above:
>> 1. Converting a Batch in Velox or other system to arrow array could be
>> much
>> more lightweight.
>> 2. Modifying, filter and copy array or string could be much more
>> lightweight
>>
>> Velox can make a Vector mutable, seems that arrow array cannot. Seems it
>> makes little difference here.
>>
>> On 2023/04/25 22:00:08 Felipe Oliveira Carvalho wrote:
>> > Hi folks,
>> >
>> > I would like to start a public discussion on the inclusion of a new
>> array
>> > format to Arrow — array-view array. The name is also up for debate.
>> >
>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
>> this
>> > array represents an array of arrays. Each element is an array-view
>> (offset
>> > and size pair) that points to a range within a nested "values" array
>> > (called "elements" in Velox docs). The nested array can be of any type,
>> > which makes this format very flexible and powerful.
>> >
>> > [image: ../_images/array-vector.png]
>> > 
>> >
>> > I'm currently working on a C++ implementation and plan to work on a Go
>> > implementation to fulfill the two-implementations requirement for format
>> > changes.
>> >
>> > The draft design:
>> >
>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
>> > - 1 child array: "values" as an array of the type parameter
>> >
>> > validity_bitmap is used to differentiate between empty array views
>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>> >
>> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
>> (as
>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
>> recommended
>> > if setting a value is not an issue to the system producing the arrays.
>> >
>> > offsets buffer is not required to be ordered and views don't have to be
>> > disjoint.
>> >
>> > [1]
>> >
>> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>> >
>> > Thanks,
>> > Felipe O. Carvalho
>> >
>>
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-27 Thread Andrew Lamb
Felipe, thank you for bringing this up.

Another approach that is sometimes used in database engines (like DuckDB)
and is often called selection vectors, is to store another bitmask that
says which elements in the array should be "selected" and which are ignored
and functions like a view.

For example, a  selection vector {0, 1, 1, 0, 1} would represent a view of
the second and third and fifth rows

I think the selection vector is as general as the ArrayVector format you
describe, and likely simpler to implement (especially in compute kernels).
The downside is that for very sparse selections on very large arrays, the
size of the selection vector may be larger than the array view

Have you considered such an approach?

Andrew

On Wed, Apr 26, 2023 at 1:27 AM wish maple  wrote:

> I think the ArrayVector can have benefits above:
> 1. Converting a Batch in Velox or other system to arrow array could be much
> more lightweight.
> 2. Modifying, filter and copy array or string could be much more
> lightweight
>
> Velox can make a Vector mutable, seems that arrow array cannot. Seems it
> makes little difference here.
>
> On 2023/04/25 22:00:08 Felipe Oliveira Carvalho wrote:
> > Hi folks,
> >
> > I would like to start a public discussion on the inclusion of a new array
> > format to Arrow — array-view array. The name is also up for debate.
> >
> > This format is inspired by Velox's ArrayVector format [1]. Logically,
> this
> > array represents an array of arrays. Each element is an array-view
> (offset
> > and size pair) that points to a range within a nested "values" array
> > (called "elements" in Velox docs). The nested array can be of any type,
> > which makes this format very flexible and powerful.
> >
> > [image: ../_images/array-vector.png]
> > 
> >
> > I'm currently working on a C++ implementation and plan to work on a Go
> > implementation to fulfill the two-implementations requirement for format
> > changes.
> >
> > The draft design:
> >
> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> > - 1 child array: "values" as an array of the type parameter
> >
> > validity_bitmap is used to differentiate between empty array views
> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
> >
> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
> (as
> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> > if setting a value is not an issue to the system producing the arrays.
> >
> > offsets buffer is not required to be ordered and views don't have to be
> > disjoint.
> >
> > [1]
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> >
> > Thanks,
> > Felipe O. Carvalho
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Micah Kornfield
Small bikeshed: But to keep naming consistent "ViewList"?


On Wed, Apr 26, 2023 at 8:02 AM Weston Pace  wrote:

> > My understanding is that the primary benefit of this ListView layout
> > over Arrow's existing List layouts [1] is that ListView allows for
> > buffer alignment [2] without padding, which makes vectorized
> > processing much more efficient. Is this understanding correct?
>
> Yes.  Though proponents of list-view would probably point out that it
> doesn't prevent you from having contiguous buffers, it simply doesn't
> require it.
>
> > Unless I am missing something, I think the selection use-case
> > could be equally well served by a dictionary-encoded
> BinarArray/ListArray,
> > and would have the benefit of not requiring any modifications to the
> > existing format or kernels.
>
> This is a good point that did not come up in the previous discussion that I
> can see.
>
> > The major additional flexibility of the proposed encoding would be
> permitting disjoint
> > or overlapping ranges, are these common enough in practice to represent a
> meaningful bottleneck?
>
> I'm not sure.  There was one other use case that was brought up in the
> original discussion.  This was that list view arrays can be constructed in
> parallel.  That is, if you know the output size (e.g. when applying a large
> scalar function), then you can have different threads fill out different
> regions of the offsets / lengths buffers.  That being said, I don't know
> for certain if anyone is relying on this behavior.
>
>
> On Wed, Apr 26, 2023 at 7:12 AM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > After Weston's suggestion above, I've renamed files and classes in my WIP
> > implementation:
> >
> > ArrayView -> ListView
> >
> > On Wed, Apr 26, 2023 at 11:08 AM Ian Cook  wrote:
> >
> > > +1 to what Weston and Joris suggested regarding the name. "ListView"
> > > seems like the best name to use for this layout in Arrow.
> > >
> > > My understanding is that the primary benefit of this ListView layout
> > > over Arrow's existing List layouts [1] is that ListView allows for
> > > buffer alignment [2] without padding, which makes vectorized
> > > processing much more efficient. Is this understanding correct?
> > >
> > > [1]
> > >
> >
> https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
> > > [2]
> > >
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
> > >
> > > Ian
> > >
> > > On Wed, Apr 26, 2023 at 5:27 AM Joris Van den Bossche
> > >  wrote:
> > > >
> > > > On Wed, 26 Apr 2023 at 02:37, Weston Pace 
> > wrote:
> > > > >
> > > > > For context, there was some discussion on this back in [1].  At
> that
> > > time
> > > > > this was called "sequence view" but I do not like that name.
> > However,
> > > > > array-view array is a little confusing.  Given this is similar to
> > list
> > > can
> > > > > we go with list-view array?
> > > >
> > > > Yes, given that this is essentially an alternative representation of
> a
> > > > logical "list" array, I would also prefer that we use the term "list"
> > > > in the name for such a new type. The word "array" has a different
> > > > meaning in context of our columnar specification.
> > >
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Weston Pace
> My understanding is that the primary benefit of this ListView layout
> over Arrow's existing List layouts [1] is that ListView allows for
> buffer alignment [2] without padding, which makes vectorized
> processing much more efficient. Is this understanding correct?

Yes.  Though proponents of list-view would probably point out that it
doesn't prevent you from having contiguous buffers, it simply doesn't
require it.

> Unless I am missing something, I think the selection use-case
> could be equally well served by a dictionary-encoded BinarArray/ListArray,
> and would have the benefit of not requiring any modifications to the
> existing format or kernels.

This is a good point that did not come up in the previous discussion that I
can see.

> The major additional flexibility of the proposed encoding would be
permitting disjoint
> or overlapping ranges, are these common enough in practice to represent a
meaningful bottleneck?

I'm not sure.  There was one other use case that was brought up in the
original discussion.  This was that list view arrays can be constructed in
parallel.  That is, if you know the output size (e.g. when applying a large
scalar function), then you can have different threads fill out different
regions of the offsets / lengths buffers.  That being said, I don't know
for certain if anyone is relying on this behavior.


On Wed, Apr 26, 2023 at 7:12 AM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> After Weston's suggestion above, I've renamed files and classes in my WIP
> implementation:
>
> ArrayView -> ListView
>
> On Wed, Apr 26, 2023 at 11:08 AM Ian Cook  wrote:
>
> > +1 to what Weston and Joris suggested regarding the name. "ListView"
> > seems like the best name to use for this layout in Arrow.
> >
> > My understanding is that the primary benefit of this ListView layout
> > over Arrow's existing List layouts [1] is that ListView allows for
> > buffer alignment [2] without padding, which makes vectorized
> > processing much more efficient. Is this understanding correct?
> >
> > [1]
> >
> https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
> > [2]
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
> >
> > Ian
> >
> > On Wed, Apr 26, 2023 at 5:27 AM Joris Van den Bossche
> >  wrote:
> > >
> > > On Wed, 26 Apr 2023 at 02:37, Weston Pace 
> wrote:
> > > >
> > > > For context, there was some discussion on this back in [1].  At that
> > time
> > > > this was called "sequence view" but I do not like that name.
> However,
> > > > array-view array is a little confusing.  Given this is similar to
> list
> > can
> > > > we go with list-view array?
> > >
> > > Yes, given that this is essentially an alternative representation of a
> > > logical "list" array, I would also prefer that we use the term "list"
> > > in the name for such a new type. The word "array" has a different
> > > meaning in context of our columnar specification.
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Felipe Oliveira Carvalho
After Weston's suggestion above, I've renamed files and classes in my WIP
implementation:

ArrayView -> ListView

On Wed, Apr 26, 2023 at 11:08 AM Ian Cook  wrote:

> +1 to what Weston and Joris suggested regarding the name. "ListView"
> seems like the best name to use for this layout in Arrow.
>
> My understanding is that the primary benefit of this ListView layout
> over Arrow's existing List layouts [1] is that ListView allows for
> buffer alignment [2] without padding, which makes vectorized
> processing much more efficient. Is this understanding correct?
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
> [2]
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
>
> Ian
>
> On Wed, Apr 26, 2023 at 5:27 AM Joris Van den Bossche
>  wrote:
> >
> > On Wed, 26 Apr 2023 at 02:37, Weston Pace  wrote:
> > >
> > > For context, there was some discussion on this back in [1].  At that
> time
> > > this was called "sequence view" but I do not like that name.  However,
> > > array-view array is a little confusing.  Given this is similar to list
> can
> > > we go with list-view array?
> >
> > Yes, given that this is essentially an alternative representation of a
> > logical "list" array, I would also prefer that we use the term "list"
> > in the name for such a new type. The word "array" has a different
> > meaning in context of our columnar specification.
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Ian Cook
+1 to what Weston and Joris suggested regarding the name. "ListView"
seems like the best name to use for this layout in Arrow.

My understanding is that the primary benefit of this ListView layout
over Arrow's existing List layouts [1] is that ListView allows for
buffer alignment [2] without padding, which makes vectorized
processing much more efficient. Is this understanding correct?

[1] https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
[2] 
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

Ian

On Wed, Apr 26, 2023 at 5:27 AM Joris Van den Bossche
 wrote:
>
> On Wed, 26 Apr 2023 at 02:37, Weston Pace  wrote:
> >
> > For context, there was some discussion on this back in [1].  At that time
> > this was called "sequence view" but I do not like that name.  However,
> > array-view array is a little confusing.  Given this is similar to list can
> > we go with list-view array?
>
> Yes, given that this is essentially an alternative representation of a
> logical "list" array, I would also prefer that we use the term "list"
> in the name for such a new type. The word "array" has a different
> meaning in context of our columnar specification.


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Joris Van den Bossche
On Wed, 26 Apr 2023 at 02:37, Weston Pace  wrote:
>
> For context, there was some discussion on this back in [1].  At that time
> this was called "sequence view" but I do not like that name.  However,
> array-view array is a little confusing.  Given this is similar to list can
> we go with list-view array?

Yes, given that this is essentially an alternative representation of a
logical "list" array, I would also prefer that we use the term "list"
in the name for such a new type. The word "array" has a different
meaning in context of our columnar specification.


RE: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread wish maple
I think the ArrayVector can have benefits above:
1. Converting a Batch in Velox or other system to arrow array could be much
more lightweight.
2. Modifying, filter and copy array or string could be much more
lightweight

Velox can make a Vector mutable, seems that arrow array cannot. Seems it
makes little difference here.

On 2023/04/25 22:00:08 Felipe Oliveira Carvalho wrote:
> Hi folks,
>
> I would like to start a public discussion on the inclusion of a new array
> format to Arrow — array-view array. The name is also up for debate.
>
> This format is inspired by Velox's ArrayVector format [1]. Logically, this
> array represents an array of arrays. Each element is an array-view (offset
> and size pair) that points to a range within a nested "values" array
> (called "elements" in Velox docs). The nested array can be of any type,
> which makes this format very flexible and powerful.
>
> [image: ../_images/array-vector.png]
> 
>
> I'm currently working on a C++ implementation and plan to work on a Go
> implementation to fulfill the two-implementations requirement for format
> changes.
>
> The draft design:
>
> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> - 1 child array: "values" as an array of the type parameter
>
> validity_bitmap is used to differentiate between empty array views
> (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>
> When the validity_bitmap[i] is 0, both sizes and offsets are undefined (as
> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> if setting a value is not an issue to the system producing the arrays.
>
> offsets buffer is not required to be ordered and views don't have to be
> disjoint.
>
> [1]
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
> Thanks,
> Felipe O. Carvalho
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common cases
where there is overlap but not full duplication, but am eager to hear any.

The dictionary encoding point Raphael makes is interesting, especially
given the existence of LargeList and FixedSizeList. For many operations, it
might make more sense to just compose those existing types.

IIUC the operations that would be unique to the ArrayView are ones altering
the shape. One could truncate each array to a certain length cheaply simply
by replacing the sizes buffer. Or perhaps there are interesting operations
on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
 wrote:

> Unless I am missing something, I think the selection use-case could be
> equally well served by a dictionary-encoded BinarArray/ListArray, and would
> have the benefit of not requiring any modifications to the existing format
> or kernels.
>
> The major additional flexibility of the proposed encoding would be
> permitting disjoint or overlapping ranges, are these common enough in
> practice to represent a meaningful bottleneck?
>
>
> On 26 April 2023 01:40:14 BST, David Li  wrote:
> >Is there a need for a 64-bit offsets version the same way we have List
> and LargeList?
> >
> >And just to be clear, the difference with List is that the lists don't
> have to be stored in their logical order (or in other words, offsets do not
> have to be nondecreasing and so we also need sizes)?
> >
> >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> >> For context, there was some discussion on this back in [1].  At that
> time
> >> this was called "sequence view" but I do not like that name.  However,
> >> array-view array is a little confusing.  Given this is similar to list
> can
> >> we go with list-view array?
> >>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> >> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>
> >> I believe one significant benefit is that take (and by proxy, filter)
> and
> >> sort are O(# of items) with the proposed format and O(# of bytes) with
> the
> >> current format.  Jorge did some profiling to this effect in [1].
> >>
> >> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >>
> >> On Tue, Apr 25, 2023 at 3:13 PM Will Jones 
> wrote:
> >>
> >>> Hi Felipe,
> >>>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>>
> >>> IIUC it would be fairly cheap to transform a ListArray to an
> ArrayView, but
> >>> expensive to go the other way.
> >>>
> >>> Best,
> >>>
> >>> Will Jones
> >>>
> >>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> >>> felipe...@gmail.com> wrote:
> >>>
> >>> > Hi folks,
> >>> >
> >>> > I would like to start a public discussion on the inclusion of a new
> array
> >>> > format to Arrow — array-view array. The name is also up for debate.
> >>> >
> >>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
> >>> this
> >>> > array represents an array of arrays. Each element is an array-view
> >>> (offset
> >>> > and size pair) that points to a range within a nested "values" array
> >>> > (called "elements" in Velox docs). The nested array can be of any
> type,
> >>> > which makes this format very flexible and powerful.
> >>> >
> >>> > [image: ../_images/array-vector.png]
> >>> > 
> >>> >
> >>> > I'm currently working on a C++ implementation and plan to work on a
> Go
> >>> > implementation to fulfill the two-implementations requirement for
> format
> >>> > changes.
> >>> >
> >>> > The draft design:
> >>> >
> >>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes
> buffer]
> >>> > - 1 child array: "values" as an array of the type parameter
> >>> >
> >>> > validity_bitmap is used to differentiate between empty array views
> >>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
> >>> >
> >>> > When the validity_bitmap[i] is 0, both sizes and offsets are
> undefined
> >>> (as
> >>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
> recommended
> >>> > if setting a value is not an issue to the system producing the
> arrays.
> >>> >
> >>> > offsets buffer is not required to be ordered and views don't have to
> be
> >>> > disjoint.
> >>> >
> >>> > [1]
> >>> >
> >>>
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> >>> >
> >>> > Thanks,
> >>> > Felipe O. Carvalho
> >>> >
> >>>
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Raphael Taylor-Davies
Unless I am missing something, I think the selection use-case could be equally 
well served by a dictionary-encoded BinarArray/ListArray, and would have the 
benefit of not requiring any modifications to the existing format or kernels.

The major additional flexibility of the proposed encoding would be permitting 
disjoint or overlapping ranges, are these common enough in practice to 
represent a meaningful bottleneck?


On 26 April 2023 01:40:14 BST, David Li  wrote:
>Is there a need for a 64-bit offsets version the same way we have List and 
>LargeList?
>
>And just to be clear, the difference with List is that the lists don't have to 
>be stored in their logical order (or in other words, offsets do not have to be 
>nondecreasing and so we also need sizes)?
>
>On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
>> For context, there was some discussion on this back in [1].  At that time
>> this was called "sequence view" but I do not like that name.  However,
>> array-view array is a little confusing.  Given this is similar to list can
>> we go with list-view array?
>>
>>> Thanks for the introduction. I'd be interested to hear about the
>>> applications Velox has found for these vectors, and in what situations
>> they
>>> are useful. This could be contrasted with the current ListArray
>>> implementations.
>>
>> I believe one significant benefit is that take (and by proxy, filter) and
>> sort are O(# of items) with the proposed format and O(# of bytes) with the
>> current format.  Jorge did some profiling to this effect in [1].
>>
>> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>>
>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones  wrote:
>>
>>> Hi Felipe,
>>>
>>> Thanks for the introduction. I'd be interested to hear about the
>>> applications Velox has found for these vectors, and in what situations they
>>> are useful. This could be contrasted with the current ListArray
>>> implementations.
>>>
>>> IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
>>> expensive to go the other way.
>>>
>>> Best,
>>>
>>> Will Jones
>>>
>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
>>> felipe...@gmail.com> wrote:
>>>
>>> > Hi folks,
>>> >
>>> > I would like to start a public discussion on the inclusion of a new array
>>> > format to Arrow — array-view array. The name is also up for debate.
>>> >
>>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
>>> this
>>> > array represents an array of arrays. Each element is an array-view
>>> (offset
>>> > and size pair) that points to a range within a nested "values" array
>>> > (called "elements" in Velox docs). The nested array can be of any type,
>>> > which makes this format very flexible and powerful.
>>> >
>>> > [image: ../_images/array-vector.png]
>>> > 
>>> >
>>> > I'm currently working on a C++ implementation and plan to work on a Go
>>> > implementation to fulfill the two-implementations requirement for format
>>> > changes.
>>> >
>>> > The draft design:
>>> >
>>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
>>> > - 1 child array: "values" as an array of the type parameter
>>> >
>>> > validity_bitmap is used to differentiate between empty array views
>>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>>> >
>>> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
>>> (as
>>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
>>> > if setting a value is not an issue to the system producing the arrays.
>>> >
>>> > offsets buffer is not required to be ordered and views don't have to be
>>> > disjoint.
>>> >
>>> > [1]
>>> >
>>> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>>> >
>>> > Thanks,
>>> > Felipe O. Carvalho
>>> >
>>>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread David Li
Is there a need for a 64-bit offsets version the same way we have List and 
LargeList?

And just to be clear, the difference with List is that the lists don't have to 
be stored in their logical order (or in other words, offsets do not have to be 
nondecreasing and so we also need sizes)?

On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> For context, there was some discussion on this back in [1].  At that time
> this was called "sequence view" but I do not like that name.  However,
> array-view array is a little confusing.  Given this is similar to list can
> we go with list-view array?
>
>> Thanks for the introduction. I'd be interested to hear about the
>> applications Velox has found for these vectors, and in what situations
> they
>> are useful. This could be contrasted with the current ListArray
>> implementations.
>
> I believe one significant benefit is that take (and by proxy, filter) and
> sort are O(# of items) with the proposed format and O(# of bytes) with the
> current format.  Jorge did some profiling to this effect in [1].
>
> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>
> On Tue, Apr 25, 2023 at 3:13 PM Will Jones  wrote:
>
>> Hi Felipe,
>>
>> Thanks for the introduction. I'd be interested to hear about the
>> applications Velox has found for these vectors, and in what situations they
>> are useful. This could be contrasted with the current ListArray
>> implementations.
>>
>> IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
>> expensive to go the other way.
>>
>> Best,
>>
>> Will Jones
>>
>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
>> felipe...@gmail.com> wrote:
>>
>> > Hi folks,
>> >
>> > I would like to start a public discussion on the inclusion of a new array
>> > format to Arrow — array-view array. The name is also up for debate.
>> >
>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
>> this
>> > array represents an array of arrays. Each element is an array-view
>> (offset
>> > and size pair) that points to a range within a nested "values" array
>> > (called "elements" in Velox docs). The nested array can be of any type,
>> > which makes this format very flexible and powerful.
>> >
>> > [image: ../_images/array-vector.png]
>> > 
>> >
>> > I'm currently working on a C++ implementation and plan to work on a Go
>> > implementation to fulfill the two-implementations requirement for format
>> > changes.
>> >
>> > The draft design:
>> >
>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
>> > - 1 child array: "values" as an array of the type parameter
>> >
>> > validity_bitmap is used to differentiate between empty array views
>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>> >
>> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
>> (as
>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
>> > if setting a value is not an issue to the system producing the arrays.
>> >
>> > offsets buffer is not required to be ordered and views don't have to be
>> > disjoint.
>> >
>> > [1]
>> >
>> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>> >
>> > Thanks,
>> > Felipe O. Carvalho
>> >
>>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Weston Pace
For context, there was some discussion on this back in [1].  At that time
this was called "sequence view" but I do not like that name.  However,
array-view array is a little confusing.  Given this is similar to list can
we go with list-view array?

> Thanks for the introduction. I'd be interested to hear about the
> applications Velox has found for these vectors, and in what situations
they
> are useful. This could be contrasted with the current ListArray
> implementations.

I believe one significant benefit is that take (and by proxy, filter) and
sort are O(# of items) with the proposed format and O(# of bytes) with the
current format.  Jorge did some profiling to this effect in [1].

[1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

On Tue, Apr 25, 2023 at 3:13 PM Will Jones  wrote:

> Hi Felipe,
>
> Thanks for the introduction. I'd be interested to hear about the
> applications Velox has found for these vectors, and in what situations they
> are useful. This could be contrasted with the current ListArray
> implementations.
>
> IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
> expensive to go the other way.
>
> Best,
>
> Will Jones
>
> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > Hi folks,
> >
> > I would like to start a public discussion on the inclusion of a new array
> > format to Arrow — array-view array. The name is also up for debate.
> >
> > This format is inspired by Velox's ArrayVector format [1]. Logically,
> this
> > array represents an array of arrays. Each element is an array-view
> (offset
> > and size pair) that points to a range within a nested "values" array
> > (called "elements" in Velox docs). The nested array can be of any type,
> > which makes this format very flexible and powerful.
> >
> > [image: ../_images/array-vector.png]
> > 
> >
> > I'm currently working on a C++ implementation and plan to work on a Go
> > implementation to fulfill the two-implementations requirement for format
> > changes.
> >
> > The draft design:
> >
> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> > - 1 child array: "values" as an array of the type parameter
> >
> > validity_bitmap is used to differentiate between empty array views
> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
> >
> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
> (as
> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> > if setting a value is not an issue to the system producing the arrays.
> >
> > offsets buffer is not required to be ordered and views don't have to be
> > disjoint.
> >
> > [1]
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> >
> > Thanks,
> > Felipe O. Carvalho
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
Hi Felipe,

Thanks for the introduction. I'd be interested to hear about the
applications Velox has found for these vectors, and in what situations they
are useful. This could be contrasted with the current ListArray
implementations.

IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
expensive to go the other way.

Best,

Will Jones

On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Hi folks,
>
> I would like to start a public discussion on the inclusion of a new array
> format to Arrow — array-view array. The name is also up for debate.
>
> This format is inspired by Velox's ArrayVector format [1]. Logically, this
> array represents an array of arrays. Each element is an array-view (offset
> and size pair) that points to a range within a nested "values" array
> (called "elements" in Velox docs). The nested array can be of any type,
> which makes this format very flexible and powerful.
>
> [image: ../_images/array-vector.png]
> 
>
> I'm currently working on a C++ implementation and plan to work on a Go
> implementation to fulfill the two-implementations requirement for format
> changes.
>
> The draft design:
>
> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> - 1 child array: "values" as an array of the type parameter
>
> validity_bitmap is used to differentiate between empty array views
> (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>
> When the validity_bitmap[i] is 0, both sizes and offsets are undefined (as
> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> if setting a value is not an issue to the system producing the arrays.
>
> offsets buffer is not required to be ordered and views don't have to be
> disjoint.
>
> [1]
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
> Thanks,
> Felipe O. Carvalho
>


[DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Felipe Oliveira Carvalho
Hi folks,

I would like to start a public discussion on the inclusion of a new array
format to Arrow — array-view array. The name is also up for debate.

This format is inspired by Velox's ArrayVector format [1]. Logically, this
array represents an array of arrays. Each element is an array-view (offset
and size pair) that points to a range within a nested "values" array
(called "elements" in Velox docs). The nested array can be of any type,
which makes this format very flexible and powerful.

[image: ../_images/array-vector.png]


I'm currently working on a C++ implementation and plan to work on a Go
implementation to fulfill the two-implementations requirement for format
changes.

The draft design:

- 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
- 1 child array: "values" as an array of the type parameter

validity_bitmap is used to differentiate between empty array views
(sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).

When the validity_bitmap[i] is 0, both sizes and offsets are undefined (as
usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
if setting a value is not an issue to the system producing the arrays.

offsets buffer is not required to be ordered and views don't have to be
disjoint.

[1]
https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector

Thanks,
Felipe O. Carvalho