[ 
https://issues.apache.org/jira/browse/ARROW-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197919#comment-16197919
 ] 

Brian Hulette commented on ARROW-1652:
--------------------------------------

[~paul.e.taylor] all good points. My primary goal with this ticket (and 
ARROW-1651) is just to improve performance when iterating over multiple vectors 
simultaneously using a {{Table}}. Currently, the {{*rows}} iterator just defers 
to each Vector's {{get(i)}} function for every index, which makes for a lot of 
batch lookups when scanning all of the data. The Vector iterators don't help 
since they can't be used simultaneously across multiple Vectors (or can they? 
is there something like python's {{zip}} that we could use?)

If the {{Table}} required that each of it's vectors have the same batches, and 
stored that in a {{batches}} array, 
[{{*rows}}|https://github.com/apache/arrow/blob/master/js/src/table.ts#L50] 
could just iterate over each batch, and then over each index within it.
{code}
*rows() {
    for (let batch of this.batches) {
       for (idx = 0; idx < batch; ++idx) {
          yield this._columns.map(function(c) { return c.get(idx, batch); });
       }
    }
}
{code}
Obviously there would need to be bounds-checking for {{startRow}} and 
{{endRow}} as well, but that wouldn't be hard to implement by tracking an 
overall index.

The batch hint could be an optional parameter on {{Vector.get(..)}}. That way 
no one is forced to use it, and random access would be more intuitive. Or it 
could be an entirely separate function {{Vector.getFromBatch(i, batch)}}.

My end goal here is just improving performance when iterating over multiple 
vectors, so if anyone has other ideas on how to do that I'd be happy to ditch 
this idea. Maybe there's some way to use multiple Vector iterators 
simultaneously that I'm missing?

> [JS] Batch hint for Vector.get
> ------------------------------
>
>                 Key: ARROW-1652
>                 URL: https://issues.apache.org/jira/browse/ARROW-1652
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: JavaScript
>            Reporter: Brian Hulette
>            Assignee: Paul Taylor
>              Labels: Performance
>
> The {{Vector.get}} function just accepts an index, and looks up the 
> appropriate record batch on every call. This can lead to a lot of additional 
> lookups when iterating by index. It would be nice if {{Vector.get}} accepted 
> an optional batch hint, similar to 
> [{{Vector.range}}|https://github.com/apache/arrow/blob/master/js/src/vector/typed.ts#L51]
> Additionally, if {{Table}} had some knowledge of the batches in its Vectors, 
> it could use this batch hint to improve performance when iterating over rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to