[ 
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205278#comment-16205278
 ] 

Paul Taylor commented on ARROW-1652:
------------------------------------

[~bhulette] 
{quote}
The Vector iterators don't help since they can't be used simultaneously across 
multiple Vectors (or can they? is there something like python's zip that we 
could use?)
{quote}

Yeah totally. I've been using [IxJS|https://github.com/ReactiveX/IxJS] for 
this, it's basically a lazy version of {{lodash}} for Iterables and 
AsyncIterables from the team that did the Enumerable and Observable extension 
methods in C#. The Ix operators work on any JS Iterable type, so they work with 
the JS Vectors. You can use the operators in a functional style, or cast the 
Vectors to Iterables and work with a full prototype (which I find is easier 
with intellisense):

{code:javascript}
import * as fs from 'fs';
import { Table } from 'apache-arrow';
import { zip } from 'ix/iterable/zip';
import { AsyncSink, AsyncIterable } from 'ix';

let tables = AsyncIterable
  .from(fs
    .createReadStream('table.arrow')
    .pipe(new AsyncSink()))
  .toArray()
  .map((buffers) => Table.from(...buffers));

for await (let table of tables) {
  let colA = table.getColumn('A');
  let colB = table.getColumn('B');
  for (let [a, b] of zip(colA, colB)) {
    console.log(a + b);
  }
}
{code}

It's also helpful when you're working with a streaming source:

{code:javascript}
import { zip } from 'ix/iterable/zip';
import { readBuffers } from 'apache-arrow';
import { AsyncSink, AsyncIterable } from 'ix';

let vectors = AsyncIterable
  .from(fetch('http://some-resource.arrow'))
  .flatMap(({ body }) => await body.getReader().pipeTo(new AsyncSink()))
  .toArray() // todo: `readBuffersAsync`, so we don't have to aggregate the 
stream here
  .flatMap((buffers) => readBuffers(...buffers))

for await (let [colA, colB] of vectors) {
  for (let sum of zip(([a, b]) => a + b, colA, colB)) {
    console.log(sum);
  }
}
{code}

In addition to zip, there's a ton of other useful columnar transforms like 
scan, map, filter, reduce, min/max/groupBy, inner/outer/groupJoin, 
distinctUntilChanged, and operations for converting between sync/async 
Iterables. Pretty slick.

{quote}
If the Table required that each of it's vectors have the same batches, and 
stored that in a batches array, *rows could just iterate over each batch, and 
then over each index within it.
{quote}

You're absolutely right, I threw the {{*rows()}} iterator in at the last minute 
b/c it looked good in a gist. Ideally it would zip the columns. The whole Table 
thing deserves a make over :-)

{quote}
The most significant changes are in {{ListVector}} and {{IndexVector}}, because 
I removed the {{returnWithBatchIndex}} argument and made some tweaks to the way 
offset vectors are handled (Rather than subtracting batch when accessing data, 
use {{length - 1}} to reduce {{index}} while iterating)
{quote}

God that's awesome, thanks for seeing that! I spent nearly a day debugging that 
until I noticed they were always off by {{batch}} many. It didn't even occur to 
me to subtract 1 while iterating 😂.

{quote}
The batch hint could be an optional parameter on Vector.get(..). That way no 
one is forced to use it, and random access would be more intuitive. Or it could 
be an entirely separate function Vector.getFromBatch(i, batch).
{quote}

Yeah sorry I forgot to mention in my previous comment that I [originally had a 
batch hint in 
get|https://github.com/graphistry/arrow/blame/05372ddf3be33847ebce9f7458ee59e7ada1b15a/src/vectors/typed.ts#L42],
 almost identical to your PR 😅. As I was working through the tests I realized 
the batch hint doesn't always compose across Vectors. Considering how heavily 
composition is employed in the Vector design, {{get}} should always be 
idempotent with respect to its iterator form.

For example, cloning a Vector and pairing it with a new validity Vector could 
result in the vectors having different internal batch structures. In this case 
the parent could call the ValidityVector's {{get}} with the wrong indexes, 
batch hints, or both. I could be mistaken, but I think applies to all the 
composed types (StructVector, ListVectorBase, DictionaryVector)? Unfortunately 
this tight coupling does exist between the ListVectorBase and its offset 
Vector. I didn't see an alternative given the memory layout, so I made it an 
internal implementation detail. It would be great to change that it if possible.

{quote}
My end goal here is just improving performance when iterating over multiple 
vectors, so if anyone has other ideas on how to do that I'd be happy to ditch 
this idea. Maybe there's some way to use multiple Vector iterators 
simultaneously that I'm missing?
{quote}

I'm with you 100%, speed is the name of the game.

Last weekend I benchmarked iterating 25MM Float32s across 22 batches (in 
turbofan) to see if I could make the reverse {{index -> batch}} lookup faster:

- a for-loop with hint takes ~340ms
- the way {{get}} works today takes ~750ms
- regular iterators take ~640ms (avoiding intermediate {{IteratorResult}} 
allocations)
- a straight memcpy takes ~100ms, the for-loop with hint takes ~340ms
- a special forward-only iterator function (that allows skipping ahead but not 
back) takes ~350ms
- skip list and sparse array were both total disasters in the ~5-10s range

These ratios seem in line with the Arrow performance tests. While I share your 
opinions on speed, the JIT is famously difficult to predict. I think our energy 
will be better applied writing Arrow-optimized iteration operators 
(filter/skip/take, sort, groupBy) that cleverly cache intermediate indexes and 
make liberal use of zero-copy slicing than attempting to micro-optimize the 
difference between {{for}} and {{for...of}} loops.

Does that all sound reasonable or am I way off base?


> [JS] Batch hint for Vector.get
> ------------------------------
>
>                 Key: ARROW-1652
>                 URL: https://issues.apache.org/jira/browse/ARROW-1652
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: JavaScript
>            Reporter: Brian Hulette
>            Assignee: Paul Taylor
>              Labels: Performance, pull-request-available
>
> The {{Vector.get}} function just accepts an index, and looks up the 
> appropriate record batch on every call. This can lead to a lot of additional 
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted 
> an optional batch hint, similar to 
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors, 
> it could use this batch hint to improve performance when iterating over rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to