[ 
https://issues.apache.org/jira/browse/IMPALA-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047114#comment-17047114
 ] 

Thomas Tauber-Marshall commented on IMPALA-9436:
------------------------------------------------

Seems like a reasonable option to investigate.

> impala-shell is very slow for large query text sizes
> ----------------------------------------------------
>
>                 Key: IMPALA-9436
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9436
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Clients
>    Affects Versions: Impala 3.4.0
>            Reporter: Thomas Tauber-Marshall
>            Priority: Critical
>
> In working on better support for large sql queries in IMPALA-9414, I found 
> that impala-shell is very slow at processing large query sizes.
> To test this, I generated a sql file of 1MB that refers to a non-existent 
> table (so that the time to run the query would be negligible). Running this 
> query file with impala-shell on my local machine takes about 20s, of which 
> about 13s are spent in parse_query_text(), which uses some sqlparse functions 
> to try to split the query text into multiple queries.
> This seems like an unreasonable overhead and could definitely be improved. 
> Some ideas for how to do that:
> 1. Be more clever with our use of sqlparse to get better perf. This probably 
> has limited value (eg. strip_comments() already tries to be very clever but 
> is still pretty slow)
> 2. Find a different python library for sql parsing that is faster (this may 
> not exist).
> 3. Add some C++ into the shell instead of always doing everything in pure 
> python (not sure how easy/convenient this is to integrate with the shell 
> packaging)
> 4. Try to write our own sql parsing code, which could be optimized for the 
> small number of things we need actually need, eg. we don't need full 
> tokenization just splitting of multiple queries (likely to be bug-prone)
> 5. Do some simple hacks, such as skipping the query splitting entirely if 
> there isn't a ';' in the query text (this would leave some unfortunate perf 
> cliffs, eg. add a ';' to a string literal in your query and suddenly 
> everything gets a lot slower)
> 6. Add an interface in Impala that allows submitting of multiple queries at 
> once, eg ExecuteStatements(), which returns a list of query_ids. (might be a 
> lot of work to modify impala-server, the parser, etc. to support this)
> 7. Add an interface in Impala that allows submitting of query text, then 
> parses it and returns it in split form without actually executing it, which 
> would limit the amount of changes needed vs. option 6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to