Thomas Tauber-Marshall created IMPALA-9436:
----------------------------------------------
Summary: impala-shell is very slow for large query text sizes
Key: IMPALA-9436
URL: https://issues.apache.org/jira/browse/IMPALA-9436
Project: IMPALA
Issue Type: Improvement
Components: Clients
Affects Versions: Impala 3.4.0
Reporter: Thomas Tauber-Marshall
In working on better support for large sql queries in IMPALA-9414, I found that
impala-shell is very slow at processing large query sizes.
To test this, I generated a sql file of 1MB that refers to a non-existent table
(so that the time to run the query would be negligible). Running this query
file with impala-shell on my local machine takes about 20s, of which about 13s
are spent in parse_query_text(), which uses some sqlparse functions to try to
split the query text into multiple queries.
This seems like an unreasonable overhead and could definitely be improved. Some
ideas for how to do that:
1. Be more clever with our use of sqlparse to get better perf. This probably
has limited value (eg. strip_comments() already tries to be very clever but is
still pretty slow)
2. Find a different python library for sql parsing that is faster (this may not
exist).
3. Add some C++ into the shell instead of always doing everything in pure
python (not sure how easy/convenient this is to integrate with the shell
packaging)
4. Try to write our own sql parsing code, which could be optimized for the
small number of things we need actually need, eg. we don't need full
tokenization just splitting of multiple queries (likely to be bug-prone)
5. Do some simple hacks, such as skipping the query splitting entirely if there
isn't a ';' in the query text (this would leave some unfortunate perf cliffs,
eg. add a ';' to a string literal in your query and suddenly everything gets a
lot slower)
6. Add an interface in Impala that allows submitting of multiple queries at
once, eg ExecuteStatements(), which returns a list of query_ids. (might be a
lot of work to modify impala-server, the parser, etc. to support this)
7. Add an interface in Impala that allows submitting of query text, then parses
it and returns it in split form without actually executing it, which would
limit the amount of changes needed vs. option 6
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]