[ 
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197777#comment-16197777
 ] 

Paul Taylor edited comment on ARROW-1652 at 10/9/17 10:26 PM:
--------------------------------------------------------------

[~bhulette] yes, it's worth revisiting how the batch lookup happens.

The reason {{get}} doesn't take a batch hint currently is that without 
inspecting the loaded batches before iteration, the iterator would need to do 
the batch lookup on each item anyway:

{code}
for (let index = 0; index < vector.length; ++index) {
  vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that 
takes a batch hint
}
{code}

Also, I measured the cost of a variant of {{get}} that returns a tuple of 
{{[value, batchIndex]}}, and it was much slower than incurring the batch-lookup 
cost each time. The best we could do is expose the batches, so an external 
iterator could increment the batch hint based on the index being iterated:

{code}
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
  if (iterated >= index) {
    iterated += batches[++batch].length;
  }
  vector.get(index - iterated, batch);
}
{code}

But this approach also has a number of drawbacks:

1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from {{start}} < {{index}} < 
{{end}}). Random lookups will still have to reverse-lookup the batch index from 
the item index each get call

The [Vector 
iterators|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78]
 don't have to do the batch index lookup, since they always iterate in a 
defined order:

{code}
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
{code}

Assuming contiguous iteration, we could add {{skip()}}/{{skipLast()}} and 
{{take()}}/{{takeLast()}} methods, which return a clone of the Vector with 
trimmed internal batches list. This would only require {{subarray}} calls on 
the first/last TypedArrays, so still zero-copy:

{code}
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
{code}

I also want to call out that {{Vector.range()}} taking a batch hint is a bit of 
slight-of-hand to accommodate the 
[ListVectorBase|(https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43]
 offset lookups. The ListVector {{from}} and {{to}} offsets are relative to the 
RecordBatch they were introduced, so passing the batch index into the slice 
method causes slice to lookup the {{from}} and {{to}} offsets relative to the 
batch they're from, instead of adjusting them relative to all the batches:

{code}
[{   /* RecordBatch 0 */
  offsets: [0, 5, 10],
  values: "hellobrian"
}, { /* RecordBatch 1 */
  offsets: [0, 4, 8],
  values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
{code}


was (Author: paul.e.taylor):
[~bhulette] yes, it's worth revisiting how the batch lookup happens.

The reason {{get}} doesn't take a batch hint currently is that without 
inspecting the loaded batches before iteration, the iterator would need to do 
the batch lookup on each item anyway:

{code}
for (let index = 0; index < vector.length; ++index) {
  vector.get(index, vector.findBatch(index)); // <-- hypothetical `get` that 
takes a batch hint
}
{code}

Also, I measured the cost of a variant of {{get}} that returns a tuple of 
{{[value, batchIndex]}}, and it was much slower than incurring the batch-lookup 
cost each time. The best we could do is expose the batches, so an external 
iterator could increment the batch hint based on the index being iterated:

{code}
let batches = vector.batches();
let iterated = 0, index = -1, batch = -1;
while (++index < vector.length) {
  if (iterated >= index) {
    iterated += batches[++batch].length;
  }
  vector.get(iterated - index, batch);
}
{code}

But this approach also has a number of drawbacks:

1. Forcing the iterator to do this every time isn't very ergonomic
2. It's only helpful for contiguous iterations (from {{start}} < {{index}} < 
{{end}}). Random lookups will still have to reverse-lookup the batch index from 
the item index each get call

The [Vector 
iterators|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L78]
 don't have to do the batch index lookup, since they always iterate in a 
defined order:

{code}
// Always iterates from 0 -> length, faster than calling `get()` in a loop
for (let x of vector) { }
{code}

Assuming contiguous iteration, we could add {{skip()}}/{{skipLast()}} and 
{{take()}}/{{takeLast()}} methods, which return a clone of the Vector with 
trimmed internal batches list. This would only require {{subarray}} calls on 
the first/last TypedArrays, so still zero-copy:

{code}
for (let x of vector.skip(10).take(10)) { // <-- iterate indexes 10 through 20
}
{code}

I also want to call out that {{Vector.range()}} taking a batch hint is a bit of 
slight-of-hand to accommodate the 
[ListVectorBase|(https://github.com/apache/arrow/blob/master/js/src/vector/list.ts#L43]
 offset lookups. The ListVector {{from}} and {{to}} offsets are relative to the 
RecordBatch they were introduced, so passing the batch index into the slice 
method causes slice to lookup the {{from}} and {{to}} offsets relative to the 
batch they're from, instead of adjusting them relative to all the batches:

{code}
[{   /* RecordBatch 0 */
  offsets: [0, 5, 10],
  values: "hellobrian"
}, { /* RecordBatch 1 */
  offsets: [0, 4, 8],
  values: "frompaul"
}]
// vector.get(2) decomposes to:
// vector.values.slice(0, 4, 1) <-- offsets 0, 4 in RecordBatch 1
{code}

> [JS] Batch hint for Vector.get
> ------------------------------
>
>                 Key: ARROW-1652
>                 URL: https://issues.apache.org/jira/browse/ARROW-1652
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: JavaScript
>            Reporter: Brian Hulette
>            Assignee: Paul Taylor
>              Labels: Performance
>
> The {{Vector.get}} function just accepts an index, and looks up the 
> appropriate record batch on every call. This can lead to a lot of additional 
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted 
> an optional batch hint, similar to 
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors, 
> it could use this batch hint to improve performance when iterating over rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to