PJ Fanning created DRILL-8096:
---------------------------------
Summary: format-excel reader: support different Shared String
implementations
Key: DRILL-8096
URL: https://issues.apache.org/jira/browse/DRILL-8096
Project: Apache Drill
Issue Type: Improvement
Components: Execution - Data Types
Reporter: PJ Fanning
One of the biggest users of memory and processing time when reading Excel files
is handling the Shared Strings Table.
excel-streaming-reader v3.3.0 supports 3 implementations.
I would suggest that Drill should use the ReadOnlySharedStringTable as the
default.
Drill currently uses the full featured Apache POI SharedStringTable by default
(which requires more memory and parsing effort).
There is also a TempFileSharedStringTable which uses a temp file to keep the
data out of heap memory. This is still pretty fast because it is implemented
using a H2 database MVMap.
If supporting allowing users configure which implementation they want sounds
useful, I can do a PR.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)