[
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205278#comment-16205278
]
Paul Taylor commented on ARROW-1652:
------------------------------------
[~bhulette]
{quote}
The Vector iterators don't help since they can't be used simultaneously across
multiple Vectors (or can they? is there something like python's zip that we
could use?)
{quote}
Yeah totally. I've been using [IxJS|https://github.com/ReactiveX/IxJS] for
this, it's basically a lazy version of {{lodash}} for Iterables and
AsyncIterables from the team that did the Enumerable and Observable extension
methods in C#. The Ix operators work on any JS Iterable type, so they work with
the JS Vectors. You can use the operators in a functional style, or cast the
Vectors to Iterables and work with a full prototype (which I find is easier
with intellisense):
{code:javascript}
import * as fs from 'fs';
import { Table } from 'apache-arrow';
import { zip } from 'ix/iterable/zip';
import { AsyncSink, AsyncIterable } from 'ix';
let tables = AsyncIterable
.from(fs
.createReadStream('table.arrow')
.pipe(new AsyncSink()))
.toArray()
.map((buffers) => Table.from(...buffers));
for await (let table of tables) {
let colA = table.getColumn('A');
let colB = table.getColumn('B');
for (let [a, b] of zip(colA, colB)) {
console.log(a + b);
}
}
{code}
It's also helpful when you're working with a streaming source:
{code:javascript}
import { zip } from 'ix/iterable/zip';
import { readBuffers } from 'apache-arrow';
import { AsyncSink, AsyncIterable } from 'ix';
let vectors = AsyncIterable
.from(fetch('http://some-resource.arrow'))
.flatMap(({ body }) => await body.getReader().pipeTo(new AsyncSink()))
.toArray() // todo: `readBuffersAsync`, so we don't have to aggregate the
stream here
.flatMap((buffers) => readBuffers(...buffers))
for await (let [colA, colB] of vectors) {
for (let sum of zip(([a, b]) => a + b, colA, colB)) {
console.log(sum);
}
}
{code}
In addition to zip, there's a ton of other useful columnar transforms like
scan, map, filter, reduce, min/max/groupBy, inner/outer/groupJoin,
distinctUntilChanged, and operations for converting between sync/async
Iterables. Pretty slick.
{quote}
If the Table required that each of it's vectors have the same batches, and
stored that in a batches array, *rows could just iterate over each batch, and
then over each index within it.
{quote}
You're absolutely right, I threw the {{*rows()}} iterator in at the last minute
b/c it looked good in a gist. Ideally it would zip the columns. The whole Table
thing deserves a make over :-)
{quote}
The most significant changes are in {{ListVector}} and {{IndexVector}}, because
I removed the {{returnWithBatchIndex}} argument and made some tweaks to the way
offset vectors are handled (Rather than subtracting batch when accessing data,
use {{length - 1}} to reduce {{index}} while iterating)
{quote}
God that's awesome, thanks for seeing that! I spent nearly a day debugging that
until I noticed they were always off by {{batch}} many. It didn't even occur to
me to subtract 1 while iterating 😂.
{quote}
The batch hint could be an optional parameter on Vector.get(..). That way no
one is forced to use it, and random access would be more intuitive. Or it could
be an entirely separate function Vector.getFromBatch(i, batch).
{quote}
Yeah sorry I forgot to mention in my previous comment that I [originally had a
batch hint in
get|https://github.com/graphistry/arrow/blame/05372ddf3be33847ebce9f7458ee59e7ada1b15a/src/vectors/typed.ts#L42],
almost identical to your PR 😅. As I was working through the tests I realized
the batch hint doesn't always compose across Vectors. Considering how heavily
composition is employed in the Vector design, {{get}} should always be
idempotent with respect to its iterator form.
For example, cloning a Vector and pairing it with a new validity Vector could
result in the vectors having different internal batch structures. In this case
the parent could call the ValidityVector's {{get}} with the wrong indexes,
batch hints, or both. I could be mistaken, but I think applies to all the
composed types (StructVector, ListVectorBase, DictionaryVector)? Unfortunately
this tight coupling does exist between the ListVectorBase and its offset
Vector. I didn't see an alternative given the memory layout, so I made it an
internal implementation detail. It would be great to change that it if possible.
{quote}
My end goal here is just improving performance when iterating over multiple
vectors, so if anyone has other ideas on how to do that I'd be happy to ditch
this idea. Maybe there's some way to use multiple Vector iterators
simultaneously that I'm missing?
{quote}
I'm with you 100%, speed is the name of the game.
Last weekend I benchmarked iterating 25MM Float32s across 22 batches (in
turbofan) to see if I could make the reverse {{index -> batch}} lookup faster:
- a for-loop with hint takes ~340ms
- the way {{get}} works today takes ~750ms
- regular iterators take ~640ms (avoiding intermediate {{IteratorResult}}
allocations)
- a straight memcpy takes ~100ms, the for-loop with hint takes ~340ms
- a special forward-only iterator function (that allows skipping ahead but not
back) takes ~350ms
- skip list and sparse array were both total disasters in the ~5-10s range
These ratios seem in line with the Arrow performance tests. While I share your
opinions on speed, the JIT is famously difficult to predict. I think our energy
will be better applied writing Arrow-optimized iteration operators
(filter/skip/take, sort, groupBy) that cleverly cache intermediate indexes and
make liberal use of zero-copy slicing than attempting to micro-optimize the
difference between {{for}} and {{for...of}} loops.
Does that all sound reasonable or am I way off base?
> [JS] Batch hint for Vector.get
> ------------------------------
>
> Key: ARROW-1652
> URL: https://issues.apache.org/jira/browse/ARROW-1652
> Project: Apache Arrow
> Issue Type: Improvement
> Components: JavaScript
> Reporter: Brian Hulette
> Assignee: Paul Taylor
> Labels: Performance, pull-request-available
>
> The {{Vector.get}} function just accepts an index, and looks up the
> appropriate record batch on every call. This can lead to a lot of additional
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted
> an optional batch hint, similar to
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors,
> it could use this batch hint to improve performance when iterating over rows.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)