[
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197777#comment-16197777
]
Paul Taylor commented on ARROW-1652:
------------------------------------
[~bhulette] yes, it's worth revisiting how the batch lookup happens.
The reason `get` doesn't take a batch hint currently is that without inspecting
the loaded batches before iteration, the iterator would need to do the batch
lookup on each item anyway:
```
for (let index = 0; index < vector.length; ++index) {
vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that
takes a batch hint
}
```
Also, I measured the cost of a variant of `get` that returns a tuple of
`[value, batchIndex]`, and it was much slower than incurring the batch-lookup
cost each time. The best we could do is expose the batches, so an external
iterator could increment the batch hint based on the index being iterated:
```
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
if (iterated >= index) {
iterated += batches[++batch].length;
}
vector.get(index, batch);
}
```
But this approach also has a number of drawbacks:
1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from `start` < `index` <
`end`). Random lookups will still have to reverse-lookup the batch index from
the item index each get call
The [Vector
iterators](https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78)
don't have to do the batch index lookup, since they always iterate in a
defined order:
```
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
```
Assuming contiguous iteration, we could add `skip(n)`/`skipLast(n)` and
`take(n)`/`takeLast(n)` methods, which return a clone of the Vector with
trimmed internal batches list. This would only require `subarray` calls on the
first/last TypedArrays, so still zero-copy:
```
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
```
It's worth point out that `Vector.range()` taking a batch hint is a bit of
slight-of-hand to accommodate the
[ListVectorBase](https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43)
offset lookups. The ListVector `from` and `to` offsets are relative to the
RecordBatch they were introduced, so passing the batch index into the slice
method causes slice to lookup the `from` and `to` offsets relative to the batch
they're from, instead of adjusting them relative to all the batches:
```
[{ /* RecordBatch 0 */
offsets: [0, 5, 10],
values: "hellobrian"
}, { /* RecordBatch 1 */
offsets: [0, 4, 8],
values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
```
> [JS] Batch hint for Vector.get
> ------------------------------
>
> Key: ARROW-1652
> URL: https://issues.apache.org/jira/browse/ARROW-1652
> Project: Apache Arrow
> Issue Type: Improvement
> Components: JavaScript
> Reporter: Brian Hulette
> Assignee: Paul Taylor
> Labels: Performance
>
> The {{Vector.get}} function just accepts an index, and looks up the
> appropriate record batch on every call. This can lead to a lot of additional
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted
> an optional batch hint, similar to
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors,
> it could use this batch hint to improve performance when iterating over rows.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)