Wes McKinney created ARROW-3408:
-----------------------------------
Summary: [C++] Add option to CSV reader to dictionary encode
individual columns or all string / binary columns
Key: ARROW-3408
URL: https://issues.apache.org/jira/browse/ARROW-3408
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Reporter: Wes McKinney
Fix For: 0.12.0
For many datasets, dictionary encoding everything can result in drastically
lower memory usage and subsequently better performance in doing analytics
One difficulty of dictionary encoding in multithreaded conversions is that
ideally you end up with one dictionary at the end. So you have two options:
* Implement a concurrent hashing scheme -- for low cardinality dictionaries,
the overhead associated with mutex contention will not be meaningful, for high
cardinality it can be more of a problem
* Hash each chunk separately, then normalize at the end
My guess is that a crude concurrent hash table with a mutex to protect
mutations and resizes is going to outperform the latter
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)