[
https://issues.apache.org/jira/browse/IMPALA-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051761#comment-17051761
]
Tim Armstrong commented on IMPALA-9436:
---------------------------------------
IMPALA-7939 is an example of some other issues that fall out of the ad-hoc SQL
parsing in the shell. In that case it's trying to figure out if a query is a
DML or not with DML_REGEX and sqlparse. It works OK in practice, but IIRC we'd
talked about some similar solutions at the time (like exposing more information
from the server about the type of the query).
> impala-shell is very slow for large query text sizes
> ----------------------------------------------------
>
> Key: IMPALA-9436
> URL: https://issues.apache.org/jira/browse/IMPALA-9436
> Project: IMPALA
> Issue Type: Improvement
> Components: Clients
> Affects Versions: Impala 3.4.0
> Reporter: Thomas Tauber-Marshall
> Priority: Critical
>
> In working on better support for large sql queries in IMPALA-9414, I found
> that impala-shell is very slow at processing large query sizes.
> To test this, I generated a sql file of 1MB that refers to a non-existent
> table (so that the time to run the query would be negligible). Running this
> query file with impala-shell on my local machine takes about 20s, of which
> about 13s are spent in parse_query_text(), which uses some sqlparse functions
> to try to split the query text into multiple queries.
> This seems like an unreasonable overhead and could definitely be improved.
> Some ideas for how to do that:
> 1. Be more clever with our use of sqlparse to get better perf. This probably
> has limited value (eg. strip_comments() already tries to be very clever but
> is still pretty slow)
> 2. Find a different python library for sql parsing that is faster (this may
> not exist).
> 3. Add some C++ into the shell instead of always doing everything in pure
> python (not sure how easy/convenient this is to integrate with the shell
> packaging)
> 4. Try to write our own sql parsing code, which could be optimized for the
> small number of things we need actually need, eg. we don't need full
> tokenization just splitting of multiple queries (likely to be bug-prone)
> 5. Do some simple hacks, such as skipping the query splitting entirely if
> there isn't a ';' in the query text (this would leave some unfortunate perf
> cliffs, eg. add a ';' to a string literal in your query and suddenly
> everything gets a lot slower)
> 6. Add an interface in Impala that allows submitting of multiple queries at
> once, eg ExecuteStatements(), which returns a list of query_ids. (might be a
> lot of work to modify impala-server, the parser, etc. to support this)
> 7. Add an interface in Impala that allows submitting of query text, then
> parses it and returns it in split form without actually executing it, which
> would limit the amount of changes needed vs. option 6
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]