[
https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dominik Moritz resolved ARROW-10220.
------------------------------------
Fix Version/s: 7.0.0
Resolution: Implemented
Done in https://github.com/apache/arrow/pull/10371
> [JS] Cache javascript utf-8 dictionary keys?
> --------------------------------------------
>
> Key: ARROW-10220
> URL: https://issues.apache.org/jira/browse/ARROW-10220
> Project: Apache Arrow
> Issue Type: Improvement
> Components: JavaScript
> Affects Versions: 1.0.1
> Reporter: Ben Schmidt
> Assignee: Dominik Moritz
> Priority: Minor
> Labels: pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> String decoding from arrow tables is a major bottleneck in using arrow in
> Javascript–it can take a second to decode a million rows. For utf-8 types,
> I'm not sure what could be done; but some memoization would help utf-8
> dictionary types.
> Currently, the javascript implementation decodes a utf-8 string every time
> you request an item from a dictionary with utf-8 data. If arrow cached the
> decoded strings to a native js Map, routine operations like looping over all
> the entries in a text column might be on the order of 10x faster. Here's an
> observable notebook [benchmarking that and a couple other
> strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].
> I would file a pull request, but 1) I would have to learn some typescript to
> do so, and 2) this idea may be undesirable because it creates new objects
> that will increase the memory footprint of a table, rather than just using
> the typed arrays.
> Some discussion of how the real-world issues here affect the arquero project
> is [here|https://github.com/uwdata/arquero/issues/1].
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)