[ 
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197777#comment-16197777
 ] 

Paul Taylor commented on ARROW-1652:
------------------------------------

[~bhulette] yes, it's worth revisiting how the batch lookup happens.

The reason `get` doesn't take a batch hint currently is that without inspecting 
the loaded batches before iteration, the iterator would need to do the batch 
lookup on each item anyway:

```
for (let index = 0; index < vector.length; ++index) {
  vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that 
takes a batch hint
}
```

Also, I measured the cost of a variant of `get` that returns a tuple of 
`[value, batchIndex]`, and it was much slower than incurring the batch-lookup 
cost each time. The best we could do is expose the batches, so an external 
iterator could increment the batch hint based on the index being iterated:

```
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
  if (iterated >= index) {
    iterated += batches[++batch].length;
  }
  vector.get(index, batch);
}
```

But this approach also has a number of drawbacks:

1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from `start` < `index` < 
`end`). Random lookups will still have to reverse-lookup the batch index from 
the item index each get call

The [Vector 
iterators](https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78)
 don't have to do the batch index lookup, since they always iterate in a 
defined order:

```
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
```

Assuming contiguous iteration, we could add `skip(n)`/`skipLast(n)` and 
`take(n)`/`takeLast(n)` methods, which return a clone of the Vector with 
trimmed internal batches list. This would only require `subarray` calls on the 
first/last TypedArrays, so still zero-copy:

```
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
```

It's worth point out that `Vector.range()` taking a batch hint is a bit of 
slight-of-hand to accommodate the 
[ListVectorBase](https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43)
 offset lookups. The ListVector `from` and `to` offsets are relative to the 
RecordBatch they were introduced, so passing the batch index into the slice 
method causes slice to lookup the `from` and `to` offsets relative to the batch 
they're from, instead of adjusting them relative to all the batches:

```
[{   /* RecordBatch 0 */
  offsets: [0, 5, 10],
  values: "hellobrian"
}, { /* RecordBatch 1 */
  offsets: [0, 4, 8],
  values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
```

> [JS] Batch hint for Vector.get
> ------------------------------
>
>                 Key: ARROW-1652
>                 URL: https://issues.apache.org/jira/browse/ARROW-1652
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: JavaScript
>            Reporter: Brian Hulette
>            Assignee: Paul Taylor
>              Labels: Performance
>
> The {{Vector.get}} function just accepts an index, and looks up the 
> appropriate record batch on every call. This can lead to a lot of additional 
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted 
> an optional batch hint, similar to 
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors, 
> it could use this batch hint to improve performance when iterating over rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to