[
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197777#comment-16197777
]
Paul Taylor edited comment on ARROW-1652 at 10/9/17 10:25 PM:
--------------------------------------------------------------
[~bhulette] yes, it's worth revisiting how the batch lookup happens.
The reason {{get}} doesn't take a batch hint currently is that without
inspecting the loaded batches before iteration, the iterator would need to do
the batch lookup on each item anyway:
{code}
for (let index = 0; index < vector.length; ++index) {
vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that
takes a batch hint
}
{code}
Also, I measured the cost of a variant of {{get}} that returns a tuple of
{{[value, batchIndex]}}, and it was much slower than incurring the batch-lookup
cost each time. The best we could do is expose the batches, so an external
iterator could increment the batch hint based on the index being iterated:
{code}
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
if (iterated >= index) {
iterated += batches[++batch].length;
}
vector.get(iterated - index, batch);
}
{code}
But this approach also has a number of drawbacks:
1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from {{start}} < {{index}} <
{{end}}). Random lookups will still have to reverse-lookup the batch index from
the item index each get call
The [Vector
iterators|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78]
don't have to do the batch index lookup, since they always iterate in a
defined order:
{code}
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
{code}
Assuming contiguous iteration, we could add {{skip()}}/{{skipLast()}} and
{{take()}}/{{takeLast()}} methods, which return a clone of the Vector with
trimmed internal batches list. This would only require {{subarray}} calls on
the first/last TypedArrays, so still zero-copy:
{code}
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
{code}
I also want to call out that {{Vector.range()}} taking a batch hint is a bit of
slight-of-hand to accommodate the
[ListVectorBase|(https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43]
offset lookups. The ListVector {{from}} and {{to}} offsets are relative to the
RecordBatch they were introduced, so passing the batch index into the slice
method causes slice to lookup the {{from}} and {{to}} offsets relative to the
batch they're from, instead of adjusting them relative to all the batches:
{code}
[{ /* RecordBatch 0 */
offsets: [0, 5, 10],
values: "hellobrian"
}, { /* RecordBatch 1 */
offsets: [0, 4, 8],
values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
{code}
was (Author: paul.e.taylor):
[~bhulette] yes, it's worth revisiting how the batch lookup happens.
The reason {{get}} doesn't take a batch hint currently is that without
inspecting the loaded batches before iteration, the iterator would need to do
the batch lookup on each item anyway:
{code}
for (let index = 0; index < vector.length; ++index) {
vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that
takes a batch hint
}
{code}
Also, I measured the cost of a variant of {{get}} that returns a tuple of
{{[value, batchIndex]}}, and it was much slower than incurring the batch-lookup
cost each time. The best we could do is expose the batches, so an external
iterator could increment the batch hint based on the index being iterated:
{code}
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
if (iterated >= index) {
iterated += batches[++batch].length;
}
vector.get(index - iterated, batch);
}
{code}
But this approach also has a number of drawbacks:
1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from {{start}} < {{index}} <
{{end}}). Random lookups will still have to reverse-lookup the batch index from
the item index each get call
The [Vector
iterators|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78]
don't have to do the batch index lookup, since they always iterate in a
defined order:
{code}
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
{code}
Assuming contiguous iteration, we could add {{skip()}}/{{skipLast()}} and
{{take()}}/{{takeLast()}} methods, which return a clone of the Vector with
trimmed internal batches list. This would only require {{subarray}} calls on
the first/last TypedArrays, so still zero-copy:
{code}
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
{code}
I also want to call out that {{Vector.range()}} taking a batch hint is a bit of
slight-of-hand to accommodate the
[ListVectorBase|(https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43]
offset lookups. The ListVector {{from}} and {{to}} offsets are relative to the
RecordBatch they were introduced, so passing the batch index into the slice
method causes slice to lookup the {{from}} and {{to}} offsets relative to the
batch they're from, instead of adjusting them relative to all the batches:
{code}
[{ /* RecordBatch 0 */
offsets: [0, 5, 10],
values: "hellobrian"
}, { /* RecordBatch 1 */
offsets: [0, 4, 8],
values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
{code}
> [JS] Batch hint for Vector.get
> ------------------------------
>
> Key: ARROW-1652
> URL: https://issues.apache.org/jira/browse/ARROW-1652
> Project: Apache Arrow
> Issue Type: Improvement
> Components: JavaScript
> Reporter: Brian Hulette
> Assignee: Paul Taylor
> Labels: Performance
>
> The {{Vector.get}} function just accepts an index, and looks up the
> appropriate record batch on every call. This can lead to a lot of additional
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted
> an optional batch hint, similar to
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors,
> it could use this batch hint to improve performance when iterating over rows.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)