Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Raphael Taylor-Davies Wed, 14 Jun 2023 01:45:42 -0700

Hi All,

I might be missing something, but rather than opening the can of wormsof alternative layouts, etc... perhaps we could support this use-case asa canonical extension type over dictionary encoded, variable-sizedarrays. I'll try to explain my reasoning below, but the major advantagewould be that this would not require any modifications to the existingspecification or codebases, and I think would support all the use-casesI've seen articulated.


Taking a step back:

- Variable-sized arrays encode value positions in a list ofmonotonically increasing integer offsets, with each slice of valuesidentified by the two consecutive offsets

- Dictionaries encode value positions in a list of integers in any order
- Views encode value positions as a pair of start and end integer offsets

If you squint, dictionary encoding a variable-sized array removes theordering restriction on the value offsets, resulting in a very similarconstruction to the proposed view arrays. The major differences thatcome to mind are:

- Views can contain partially overlapping value ranges, whereasdictionaries can only encode disjoint or identical ranges- Views require one less addition operation to access a value, and maytherefore have slightly different performance characteristics- A null in a view takes up two offsets, a null in a dictionary takesone key- Selection/take/filter on a dictionary is the same as for a primitivearray, and should be ~2x faster than a view- The dictionary could use a smaller key size than the offset size ofits child list, potentially saving space compared to a view- Unreferenced values in a view are free, whereas they take up an offsetin the child list

- Kernels may be optimised assuming small and/or unique dictionaries

- IPC files currently only support a single dictionary per column,something I suspect we may want to revisit regardless

I do think using dictionaries in this way is perhaps a little confusingto users, and a canonical extension type will only go so far to addressthis, but I personally think this is a reasonable price to pay foravoiding the ecosystem fragmentation that would potentially result fromintroducing alternative layouts for common types such as strings, whilstalso avoiding the need to reimplement a load of kernels.


What do people think?

Kind Regards,

Raphael Taylor-Davies

On 14/06/2023 06:34, Will Jones wrote:

Hello Arrow devs,

Just a quick note. To answer one of my earlier questions:

1. Is this array type currently only used in Velox? (not DuckDB like some

of the other new types?) What evidence do we have that it will become used
outside of Velox?

This type is also used by DuckDB. Found discussion today in a talk from
Mark Raasveldt [1]. That does improve the case for adding this type in my
eyes.

Best,

Will Jones

[1] https://youtu.be/bZOvAKGkzpQ?t=1570



On Tue, Jun 6, 2023 at 7:40 PM Weston Pace <weston.p...@gmail.com> wrote:

This implies that each canonical alternative layout would codify a
primary layout as its "fallback."

Yes, that was part of my proposal:

  * A new layout, if it is semantically equivalent to another, is

considered an alternative layout

Or, to phrase it another way.  If there is not a "fallback" then it is not
an alternative layout.  It's a brand new primary layout.  I'd expect this
to be quite rare.  I can't really even hypothesize any examples.  I think
the only truly atomic layouts are fixed-width, list, and struct.

This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported

Most APIs that I'm aware of already do this.  For example,
pyarrow.parquet.read_table has a "read_dictionary" property that can be
used to control whether or not a column is returned with the dictionary
encoding.  There is no way (that I'm aware of) to get a column in REE
encoding today without explicitly requesting it.  In fact, this could be as
simple as a boolean "use_advanced_features" flag although I would
discourage something so simplistic.  The point is that arrow-compatible
software should, by default, emit types that are supported by all arrow
implementations.

Of course, there is no way to enforce this, it's just a guideline / strong
recommendation on how software should behave if it wants to state "arrow
compatible" as a feature.

On Tue, Jun 6, 2023 at 3:33 PM Ian Cook <ianmc...@apache.org> wrote:

Thanks Weston. That all sounds reasonable to me.

  with the caveat that the primary layout must be emitted if the user

does not specifically request the alternative layout

This implies that each canonical alternative layout would codify a
primary layout as its "fallback." This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported.
I suppose such details should be discussed in a separate thread, but I
raise this here just to point out that it implies an expansion in the
scope of what Arrow interfaces can do.

On Tue, Jun 6, 2023 at 6:17 PM Weston Pace <weston.p...@gmail.com>

wrote:

 From Micah:

This sounds reasonable to me but my main concern is, I'm not sure

there is

a great mechanism to enforce canonical layouts don't somehow become

default

(or the only implementation).

I'm not sure I understand.  Is the concern that an alternative layout

is

eventually
used more and more by implementations until it is used more often than

the

primary
layouts?  In that case I think that is ok and we can promote the

alternative

to a primary layout.

Or is the concern that some applications will only support the

alternative

layouts and
not the primary layout?  In that case I would argue the application is

not

"arrow compatible".
I don't know that we prevent or enforce this today either.  An author

can

always falsely
claim they support Arrow even if they are using their own bespoke

format.

 From Ian:

It seems to me that most projects that are implementing Arrow today
are not aiming to provide complete coverage of Arrow; rather they are
adopting Arrow because of its role as a standard and they are
implementing only as much of the Arrow standard as they require to
achieve some goal. I believe that such projects are important Arrow
stakeholders, and I believe that this proposed notion of canonical
alternative layouts will serve them well and will create efficiencies
by standardizing implementations around a shared set of alternatives.

However I think that the documentation for canonical alternative
layouts should strongly encourage implementers to default to using

the

primary layouts defined in the core spec and only use alternative
layouts in cases where the primary layouts do not meet their needs.

I'd maybe take a slightly harsher stance.  I don't think an application
needs to
support all types.  For example, an Arrow-native string processing

library

might
not worry about the integer types.  That would be fine.  I think it

would

still
be fair to call it an "arrow compatible string processing library".

However, an application must support primary layouts in addition to
alternative
layouts.  For example, a string processing library that expects all

strings

to be
delivered as a single buffer sequence of null-terminated strings would

not

be "an
arrow compatible string processing library" unless it also fully

supported

the
standard (lengths + data) variable-sized list layout for strings

defined

at

[1].

In other words:

  * Only receives and emits alternative layouts - not arrow compatible
  * Only receives and emits primary layouts - arrow compatible
  * Receives and emits both primary and alternative layouts - arrow
compatible†

† - with the caveat that the primary layout must be emitted if the user
does not
specifically request the alternative layout.

[1]

https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

On Tue, Jun 6, 2023 at 2:45 PM Ian Cook <ianmc...@apache.org> wrote:

To clarify why we cannot simply propose adding ListView as a new
“canonical extension type”: The extension type mechanism in Arrow
depends on the underlying data being organized in an existing Arrow
layout—that way an implementation that does not support the extension
type can still handle the underlying data. But ListView is a wholly
new layout.

I strongly agree with Weston’s idea that it is a good time for Arrow
to introduce the notion of “canonical alternative layouts.”

Taken together, I think that canonical extension types and canonical
alternative layouts could serve as an “incubator” for proposed new
representations. For example, if a proposed canonical alternative
layout ends up being broadly adopted, then that will serve as a

signal

that we should consider adding it as a primary layout in the core
spec.

It seems to me that most projects that are implementing Arrow today
are not aiming to provide complete coverage of Arrow; rather they are
adopting Arrow because of its role as a standard and they are
implementing only as much of the Arrow standard as they require to
achieve some goal. I believe that such projects are important Arrow
stakeholders, and I believe that this proposed notion of canonical
alternative layouts will serve them well and will create efficiencies
by standardizing implementations around a shared set of alternatives.

However I think that the documentation for canonical alternative
layouts should strongly encourage implementers to default to using

the

primary layouts defined in the core spec and only use alternative
layouts in cases where the primary layouts do not meet their needs.


On Sat, May 27, 2023 at 7:44 PM Micah Kornfield <

emkornfi...@gmail.com

wrote:

This sounds reasonable to me but my main concern is, I'm not sure

there

is

a great mechanism to enforce canonical layouts don't somehow become

default

(or the only implementation).

Even for these new layouts, I think it might be worth rethinking

binding

layout into the schema versus having a different concept of

encoding

(and

changing some of the corresponding data structures).


On Mon, May 22, 2023 at 10:37 AM Weston Pace <

weston.p...@gmail.com>

wrote:

Trying to settle on one option is a fruitless endeavor.  Each

type

has

pros

and cons.  I would also predict that the largest existing usage

of

Arrow is

shuttling data from one system to another.  The newly proposed

format

doesn't appear to have any significant advantage for that use

case

(if

anything, the existing format is arguably better as it is more

compact).

I am very biased towards historical precedent and avoiding

breaking

changes.

We have "canonical extension types", perhaps it is time for

"canonical

alternative layouts".  We could define it as such:

  * There are one or more primary layouts
    * Existing layouts are automatically considered primary

layouts,

even if

they wouldn't
      have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is

considered

an alternative layout
  * An alternative layout still has the same requirements for

adoption

(two

implementations and a vote)
    * An implementation should not feel pressured to rush and

implement

the

new layout.
      It would be good if they contribute in the discussion and

consider the

layout and vote
      if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical

alternative

layouts

as we want:
    * A canonical alternative layout should, at a minimum, have

some

      reasonable justification, such as improved performance for

algorithm X

  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative,

however:

    * An Arrow implementation MUST first support the primary

layout

    * An Arrow implementation MUST support conversion to/from the

primary

and canonical layout
    * An Arrow implementation's APIs MUST only provide data in the
alternative
      layout if it is explicitly asked for (e.g. schema inference

should

prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a

canonical

alternative) but, in these
     votes we don't only consider the value (e.g. performance) of

the

layout

but also the interoperability.
     In other words, a layout can only become a primary layout if

there

is

significant evidence that most
     implementations plan to adopt it.

This lets us evolve support for new layouts more naturally.  We

can

generally assume that users will not, initially, be aware of

these

alternative layouts.  However, everything will just work.  They

may

start

to see a performance penalty stemming from a lack of support for

these

layouts.  If this performance penalty becomes significant then

they

will

discover it and become aware of the problem.  They can then ask

whatever

library they are using to add support for the alternative layout.

As

enough users find a need for it then libraries will add support.
Eventually, enough libraries will support it that we can adopt it

as a

primary layout.

Also, it allows libraries to adopt alternative layouts more

aggressively if

they would like while still hopefully ensuring that we eventually

all

converge on the same implementation of the alternative layout.

On Mon, May 22, 2023 at 9:35 AM Will Jones <

will.jones...@gmail.com>

wrote:

Hello Arrow devs,

I don't understand why we would start deprecating features in

the

Arrow

format. Even starting this talk might already be a bad idea

PR-wise.

I agree we don't want to make breaking changes to the Arrow

format.

But

several maintainers have already stated they have no interest

in

maintaining both list types with full compute functionality

[1][2],

so I

think it's very likely one list type or the other will be
implicitly preferred in the ecosystem if this data type was

added.

If

that's the case, I'd prefer that we agreed as a community which

one

should

be preferred. Maybe that's not the best path; it's just one way

for

us to

balance stability, maintenance burden, and relevance.

Can someone help distill down the primary rationale and usecase

for

adding ArrayView to the Arrow Spec?

Looking back at that old thread, I think one of the main

motivations

is

to

try to prevent query engine implementers from feeling there is

tradeoff

between having state-of-the-art performance and being

Arrow-native.

For

some of the new array types, we had both Velox and DuckDB use

them,

so it

was reasonable to expect they were innovations that might

proliferate.

I'm

not sure if the ArrayView is part of that. From Wes earlier

[3]:

The idea is that in a world of data and query federation (for

example,

consider [1] where Arrow is being used as a data federation

layer

with

many

query engines), we want to increase the amount of data

in-flight

and

in-memory that is in Arrow format. So if query engines are

having

to

depart

substantially from the Arrow format to get performance, then

this

creates a

potential lose-lose situation: * Depart from Arrow: get

better

performance

but pay serialization costs to read and write Arrow (the

performance

and

resource utilization benefits outweigh the serialization

costs).

This

puts

additional pressure on query engines to build specialized

components

for

solving problems rather than making use of off-the-shelf

components

that

use Arrow. This has knock-on effects on ecosystem

fragmentation. *

Or

use

Arrow, and accept suboptimal query processing performance


Will mentions one usecase is Velox consuming python UDF output,

which

seems

to be mostly about how fast Velox can consume this format,

not

how

fast

it

can be written. Are there other usecases?

To be clear, I don't know if that's the use case they want.

That's

just

me

speculating.

I still have some questions myself:

1. Is this array type currently only used in Velox? (not DuckDB

like

some

of the other new types?) What evidence do we have that it will

become

used

outside of Velox?
2. We already have three list types: list, large list (64-bit

offsets),

and

fixed size list. Do we think we will only want a view version

of

the

32-bit

offset variable length list? Or are we potentially talking

about

view

variants for all three?

Best,

Will Jones

[1]

https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx

[2]

https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3

[3]

https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t

On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <

al...@influxdata.com>

wrote:

Can someone help distill down the primary rationale and

usecase for

adding ArrayView to the Arrow Spec?

 From the above discussions, the stated rationale seems to be

fast

(zero-copy) interchange with Velox.

This thread has qualitatively enumerated the benefits of

(offset+len)

encoding over the existing Arrow ListArray (offets) approach,

but I

haven't

seen any performance measurements that might help us to gauge

the

tradeoff

in additional complexity vs runtime overhead.

Will mentions one usecase is Velox consuming python UDF

output,

which

seems

to be mostly about how fast Velox can consume this format,

not

how

fast

it

can be written. Are there other usecases?

Do we have numbers showing how much overhead converting to

/from

Velox's

internal representation and the existing ListArray adds? Has

anyone in

Velox land considered adding faster support for Arrow style

ListArray

encoding?


Andrew

On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou <

anto...@python.org

wrote:

Hi,

I don't understand why we would start deprecating features

in the

Arrow

format. Even starting this talk might already be a bad idea

PR-wise.

As for implementing conversions at the I/O boundary, it's a

reasonably

policy, but it still requires work by implementors and it's

not

granted

that all consumers of the Arrow format will grow such

conversions

if/when we add non-trivial types such as ListView or

StringView.

Regards

Antoine.


Le 22/05/2023 à 00:39, Will Jones a écrit :

One more thing: Looking back on the previous

discussion[1]

(which

Weston

pointed out in his earlier message), Jorge suggested that

the

old

list

types might be deprecated in favor of view variants [2].

Others

were

worried that it might undermine the perception that the

Arrow

format

is

stable. I think it might be worth thinking about "soft

deprecating"

the

old

list type: suggesting new implementations prefer the list

view, but

reassuring that implementations should support the old

format,

even

if

they

just convert to the new format. To be clear, this

wouldn't

mean we

plan

to

create breaking changes in the format; but if we ever did

for

other

reasons, the old list type might go.

Arrow compute libraries could choose either format for

compute

support,

and

plan to do conversion at the boundaries. Libraries that

use

the new

type

will have cheap conversion when reading the old type.

Meanwhile

those

that

are still on the old type will have some incentive to

move

towards

the

new

one, since that conversion will not be as efficient.

[1]

https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

[2]

https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx

On Sun, May 21, 2023 at 3:07 PM Will Jones <

will.jones...@gmail.com>

wrote:

Hello,

I think Sasha brings up a good point, that the

advantages

of

this

format

seem to be primarily about query processing. Other

encodings

like

REE

and

dictionary have space-saving advantages that justify

them

simply

in

terms

of space efficiency (although they have query processing

advantages

as

well). As discussed, most use cases are already well

served by

existing

list types and dictionary encoding.

I agree that there are cases where transferring this

type

without

conversion would be ideal. One use case I can think of

is

if

Velox

wants to

be able to take Arrow-based UDFs (possibly written with

PyArrow,

for

example) that operate on this column type and therefore

wants

zero-copy

exchange over the C Data Interface.

One big question I have: we already have three list

types:

list,

large

list (64-bit offsets), and fixed size list. Do we think

we

will

only

want a

view version of the 32-bit offset variable length list?

Or

are we

potentially talking about view variants for all three?

Best,

Will Jones


On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira

Carvalho <

felipe...@gmail.com> wrote:

The benefit of having a memory format that’s friendly

to

non-deterministic

order writes is unlocked by the transport and

processing

of

the

data

being

agnostic to the physical order as much as possible.

Requiring a conversion could cancel out that benefit.

But it

can

be a

provisory step for compatibility between systems that

don’t

understand

the

format yet. This is similar to the situation with

compression

schemes

like

run-end encoding — the goal is processing the

compressed

data

directly

without an expansion step whenever possible.

This is why having it as part of the open Arrow format

is so

important:

everyone can agree on a format that’s friendly to

parallel

and/or

vectorized compute kernels without introducing multiple

incompatible

formats to the ecosystem and without imposing a

conversion

step

between

the
different systems.

—
Felipe

On Sat, 20 May 2023 at 20:04 Aldrin

<octalene....@pm.me.invalid>

wrote:

I don't feel like this representation is necessarily a

detail of

the

query

engine, but I am also not sure why this representation

would

have

to

be

converted to a non-view format when serializing. Could

you

clarify

that? My

impression is that this representation could be used

for

persistence

or

data transfer, though it can be more complex to

guarantee

the

portion

of

the buffer that an index points to is also present in

memory.

Sent from Proton Mail for iOS


On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <

krassovskysa...@gmail.com

<On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>

wrote:

Hi everyone,
I understand that there are numerous benefits to this

representation

during query processing, but would it be fair to say

that

this

is

an

implementation detail of the query engine? Query

engines

don’t

necessarily

need to conform to the Arrow format internally, only

at

ingest/egress

points, and performing a conversion from the non-view

to

view

format

seems

like it would be very cheap (though I understand not

necessarily

the

other

way around, but you’d need to do that anyway if you’re

serializing).

Sasha Krassovsky

20 мая 2023 г., в 13:00, Will Jones <

will.jones...@gmail.com>

написал(а):

Thanks for sharing these details, Pedro. The

conditional

branches

argument

makes a lot of sense to me.

The tensors point brings up some interesting issues.

For

now,

we've

defined

our only tensor extension type to be built on a fixed

size

list.

If a

use

case of this might be manipulating tensors with zero

copy,

perhaps

that

suggests that we want a fixed size list variant? In

addition,

would

we

have

to define another extension type to be a ListView

variant?

Or

would

we

want

to think about making extension types somehow valid

across

various

encodings of the same "logical type"?

On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha

Pedreira

<pedro...@meta.com.invalid> wrote:

Hi all,

This is Pedro from the Velox team at Meta. This is

my

first

time

here,

so

nice to e-meet you!

Adding to what Felipe said, the main reason we

created

“ListView”

(though

we just call them ArrayVector/MapVector in Velox) is

that,

along

with

StringViews for strings, they allow us to write any

columnar

buffer

out-or-order, regardless of their types or

encodings.

This is

naturally

doable for all primitive types (fixed-size), but not

for

types

that

don’t

have fixed size and are required to be contiguous.

The

StringView

and

ListView formats allow us to keep this invariant in

Velox.

Being able to write vectors out-of-order is useful

when

executing

conditionals like IF/SWITCH statements, which are

pervasive

among

our

workloads. To fully vectorize it, one first

evaluates

the

expression,

then

generate a bitmap containing which rows take the

THEN

and

which

take

the

ELSE branch. Then you populate all rows that match

the

first

branch

by

evaluating the THEN expression in a vectorized

(branch-less

and

cache

friendly) way, and subsequently the ELSE branch. If

you

can’t

write

them

out-of-order, you would either have a big branch per

row

dispatching

to

the

right expression (slow), or populate two distinct

vectors

then

merging

them

at the end (potentially even slower). How much

faster

our

approach

is

highly depends on the buffer sizes and expressions,

but we

found

it

to

be

faster enough on average to justify us extending the

underlying

layout.

With that said, this is all within a single thread

of

execution.

Parallelization is done by feeding each thread its

own

vector/data.

As

pointed out in a previous message, this also gives

you the

flexibility

to

implement cardinality increasing/reducing

operations,

but

we

don’t

use

it

for that purpose. Operations like filtering,

joining,

unnesting

and

similar

are done by wrapping the internal vector in a

dictionary,

as

these

need

to

work not only on “ListViews” but on any data types

with

any

encoding.

There

are more details on Section 4.2.1 in [1]

Beyond this, it also gives function/kernel

developers

more

flexibility

to

implement operations that manipulate Arrays/Maps.

For

example,

operations

that slice these containers can be implemented in a

zero-copy

manner

by

just rearranging the lengths/offsets indices,

without

ever

touching

the

larger internal buffers. This is a similar

motivation

as

for

StringView

(think of substr(), trim(), and similar). One nice

last

property

is

that

this layout allows for overlapping ranges. This is

something

discussed

with

our ML people to allow deduping feature values in a

tensor

(which

is

fairly

common), but not something we have leveraged just

yet.

[1] -

https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

Best,
--
Pedro Pedreira
________________________________
From: Felipe Oliveira Carvalho <felipe...@gmail.com
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Cc: Pedro Eugenio Rocha Pedreira <pedro...@meta.com
Subject: Re: [DISCUSS][Format] Starting the draft

implementation

of

the

ArrayView array format

+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael

Taylor-Davies

<r.

taylordavies@ googlemail. com. invalid> wrote: Hi

All, >

if

we

added

this, do we think many Arrow and query > engine

implementations

(for

example, DataFusion) will be
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd
+pedroerp

On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:
Hi All,

if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion)

will be

eager

to

add

full

support for the type, including compute kernels? Or

are

they

likely

to

just

convert this type to ListArray at import

boundaries?

I can't speak for query engines in general, but at

least

for

arrow-rs

and by extension DataFusion, and based on my current

understanding

of

the use-cases I would be rather hesitant to add

support

to the

kernels

for this array type, definitely instead favouring

conversion

at

the

edges. We already have issues with the amount of

code

generation

resulting in binary bloat and long compile times,

and

worry

this

would

worsen this situation whilst not really providing

compelling

advantages

for the vast majority of workloads that don't

interact

with

Velox.

Whilst I can definitely see that the ListView

representation

is

probably

a better way to represent variable length lists than

what

arrow

settled

upon, I'm not yet convinced it is sufficiently

better

to

incentivise

broad ecosystem adoption.

Kind Regards,

Raphael Taylor-Davies

On 11/05/2023 21:20, Will Jones wrote:
Hi Felipe,

Thanks for the additional details.

Velox kernels benefit from being able to append

data to

the

array

from

different threads without care for strict

ordering.

Only the

offsets

array

has to be written according to logical order but

that is

potentially a

much

smaller buffer than the values buffer.

It still seems to me like applications are still

pretty

niche,

as I

suspect

in most cases the benefits are outweighed by the

costs.

The

benefit

here

seems pretty limited: if you are trying to split

work

between

threads,

usually you will have other levels such as array

chunks

to

parallelize.

And

if you have an incoming stream of row data, you'll

want

to

append

in

predictable order to match the order of the other

arrays. Am

missing

something?

And, IIUC, the cost of using ListView with

out-of-order

values

over

ListArray is you lose memory locality; the values

of

element

are

no

longer adjacent to the values of element 1. What do

you

think

about

that

tradeoff?

I don't mean to be difficult about this. I'm

excited

for

both

the

REE

and

StringView arrays, but this one I'm not so sure

about

yet. I

suppose

what I

am trying to ask is, if we added this, do we think

many

Arrow

and

query

engine implementations (for example, DataFusion)

will be

eager

to

add

full

support for the type, including compute kernels? Or

are

they

likely

to

just

convert this type to ListArray at import

boundaries?

Because if it turns out to be the latter, then we

might

as

well

ask

Velox

to export this type as ListArray and save the rest

of the

ecosystem

some

work.

Best,

Will Jones

On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira

Carvalho <

felipe...@gmail.com<mailto:felipe...@gmail.com>>

wrote:

Initial reason for ListView arrays in Arrow is

zero-copy

compatibility

with

Velox which uses this format.

Velox kernels benefit from being able to append

data to

the

array

from

different threads without care for strict

ordering.

Only the

offsets

array

has to be written according to logical order but

that is

potentially a

much

smaller buffer than the values buffer.

Acero kernels could take advantage of that in the

future.

In implementing ListViewArray/Type I was able to

reuse

some

C++

templates

used for ListArray which can reduce some of the

burden

on

kernel

implementations that aim to work with all the

types.

I’m can fix Acero kernels for working with

ListView.

This is

similar

to

the

work I’ve doing in kernels dealing with run-end

encoded

arrays.

—
Felipe


On Wed, 26 Apr 2023 at 01:03 Will Jones <

will.jones...@gmail.com

<mailto:will.jones...@gmail.com>> wrote:

I suppose one common use case is materializing

list

columns

after

some

expanding operation like a join or unnest.

That's a

case

where

could

imagine a lot of repetition of values. Haven't

yet

thought

of

common

cases

where there is overlap but not full duplication,

but am

eager

to

hear

any.

The dictionary encoding point Raphael makes is

interesting,

especially

given the existence of LargeList and

FixedSizeList. For

many

operations,

it

might make more sense to just compose those

existing

types.

IIUC the operations that would be unique to the

ArrayView

are

ones

altering

the shape. One could truncate each array to a

certain

length

cheaply

simply

by replacing the sizes buffer. Or perhaps there

are

interesting

operations

on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael

Taylor-Davies

<r.taylordav...@googlemail.com.invalid> wrote:

Unless I am missing something, I think the

selection

use-case

could

be

equally well served by a dictionary-encoded

BinarArray/ListArray,

and

would

have the benefit of not requiring any

modifications

to the

existing

format

or kernels.

The major additional flexibility of the proposed

encoding

would

be

permitting disjoint or overlapping ranges, are

these

common

enough

in

practice to represent a meaningful bottleneck?


On 26 April 2023 01:40:14 BST, David Li <

lidav...@apache.org

<mailto:

lidav...@apache.org>> wrote:

Is there a need for a 64-bit offsets version

the

same way

we

have

List

and LargeList?

And just to be clear, the difference with List

is

that

the

lists

don't

have to be stored in their logical order (or in

other

words,

offsets

do

not

have to be nondecreasing and so we also need

sizes)?

On Wed, Apr 26, 2023, at 09:37, Weston Pace

wrote:

For context, there was some discussion on this

back

in

[1].

At

that

time

this was called "sequence view" but I do not

like

that

name.

However,

array-view array is a little confusing. Given

this

is

similar

to

list

can

we go with list-view array?

Thanks for the introduction. I'd be

interested

to

hear

about

the

applications Velox has found for these

vectors,

and in

what

situations

they

are useful. This could be contrasted with the

current

ListArray

implementations.

I believe one significant benefit is that take

(and

by

proxy,

filter)

and

sort are O(# of items) with the proposed

format

and

O(#

of

bytes)

with

the

current format. Jorge did some profiling to

this

effect

in

[1].

[1]

https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<

https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

On Tue, Apr 25, 2023 at 3:13 PM Will Jones <

will.jones...@gmail.com

<mailto:will.jones...@gmail.com>

wrote:

Hi Felipe,

Thanks for the introduction. I'd be

interested

to

hear

about

the

applications Velox has found for these

vectors,

and in

what

situations

they

are useful. This could be contrasted with the

current

ListArray

implementations.

IIUC it would be fairly cheap to transform a

ListArray

to

an

ArrayView, but

expensive to go the other way.

Best,

Will Jones

On Tue, Apr 25, 2023 at 3:00 PM Felipe

Oliveira

Carvalho

felipe...@gmail.com<mailto:

felipe...@gmail.com

wrote:

Hi folks,

I would like to start a public discussion on

the

inclusion

of a

new

array

format to Arrow — array-view array. The name

is

also

up

for

debate.

This format is inspired by Velox's

ArrayVector

format

[1].

Logically,

this

array represents an array of arrays. Each

element

is

an

array-view

(offset

and size pair) that points to a range

within a

nested

"values"

array

(called "elements" in Velox docs). The

nested

array

can

be

of

any

type,

which makes this format very flexible and

powerful.

[image: ../_images/array-vector.png]
<

https://facebookincubator.github.io/velox/_images/array-vector.png

https://facebookincubator.github.io/velox/_images/array-vector.png

I'm currently working on a C++

implementation

and

plan

to

work

on a

Go

implementation to fulfill the

two-implementations

requirement

for

format

changes.

The draft design:

- 3 buffers: [validity_bitmap, int32 offsets

buffer,

int32

sizes

buffer]

- 1 child array: "values" as an array of the

type

parameter

validity_bitmap is used to differentiate

between

empty

array

views

(sizes[i] == 0) and NULL array views

(validity_bitmap[i]

==

0).

When the validity_bitmap[i] is 0, both sizes

and

offsets

are

undefined

(as

usual), and when sizes[i] == 0, offsets[i]

is

undefined. 0

is

recommended

if setting a value is not an issue to the

system

producing

the

arrays.

offsets buffer is not required to be ordered

and

views

don't

have

to

be

disjoint.

[1]

https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector

https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector

Thanks,
Felipe O. Carvalho

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to