Ben Schmidt created ARROW-10220:
-----------------------------------

             Summary: Cache javascript utf-8 dictionary keys?
                 Key: ARROW-10220
                 URL: https://issues.apache.org/jira/browse/ARROW-10220
             Project: Apache Arrow
          Issue Type: Improvement
          Components: JavaScript
    Affects Versions: 1.0.1
            Reporter: Ben Schmidt


String decoding from arrow tables is a major bottleneck in using arrow in 
Javascript–it can take a second to decode a million rows. For utf-8 types, I'm 
not sure what could be done; but some memoization would help utf-8 dictionary 
types.

Currently, the javascript implementation decodes a utf-8 string every time you 
request an item from a dictionary with utf-8 data. If arrow cached the decoded 
strings to a native js Map, routine operations like looping over all the 
entries in a text column might be on the order of 10x faster. Here's an 
observable notebook [benchmarking that and a couple other 
strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].

I would file a pull request, but 1) I would have to learn some typescript to do 
so, and 2) this idea may be undesirable because it creates new objects that 
will increase the memory footprint of a table, rather than just using the typed 
arrays.

Some discussion of how the real-world issues here affect the arquero project is 
[here|https://github.com/uwdata/arquero/issues/1].

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to