[
https://issues.apache.org/jira/browse/IMPALA-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677323#comment-16677323
]
ASF subversion and git services commented on IMPALA-4268:
---------------------------------------------------------
Commit b84075f7f0067c54a340872e5ec1814398310318 in impala's branch
refs/heads/master from [[email protected]]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=b84075f ]
IMPALA-7477: Batch-oriented query set construction
Rework the row-by-row construction of query result sets in PlanRootSink
so that it materialises an output column at a time. Make some minor
optimisations like preallocating output vectors and initialising
strings more efficiently.
My intent is both to make this faster and to make the QueryResultSet
interface better before IMPALA-4268 does a bunch of surgery on this
part of the code.
Testing:
Ran core tests.
Perf:
Downloaded tpch_parquet.orders via JDBC driver.
Before: 3.01s, After: 2.57s.
Downloaded l_orderkey from tpch_parquet.lineitem.
Before: 1.21s, After: 1.08s.
Change-Id: I764fa302842438902cd5db2551ec6e3cb77b6874
Reviewed-on: http://gerrit.cloudera.org:8080/11879
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> buffer more than a batch of rows at coordinator
> -----------------------------------------------
>
> Key: IMPALA-4268
> URL: https://issues.apache.org/jira/browse/IMPALA-4268
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.8.0
> Reporter: Henry Robinson
> Assignee: Bikramjeet Vig
> Priority: Major
> Labels: query-lifecycle, resource-management
> Attachments: rows-produced-histogram.png
>
>
> In IMPALA-2905, we are introducing a {{PlanRootSink}} that handles the
> production of output rows at the root of a plan.
> The implementation in IMPALA-2905 has the plan execute in a separate thread
> to the consumer, which calls {{GetNext()}} to retrieve the rows. However, the
> sender thread will block until {{GetNext()}} is called, so that there are no
> complications about memory usage and ownership due to having several batches
> in flight at one time.
> However, this also leads to many context switches, as each {{GetNext()}} call
> yields to the sender to produce the rows. If the sender was to fill a buffer
> asynchronously, the consumer could pull out of that buffer without taking a
> context switch in many cases (and the extra buffering might smooth out any
> performance spikes due to client delays, which currently directly affect plan
> execution).
> The tricky part is managing the mismatch between the size of the row batches
> processed in {{Send()}} and the size of the fetch result asked for by the
> client. The sender materializes output rows in a {{QueryResultSet}} that is
> owned by the coordinator. That is not, currently, a splittable object -
> instead it contains the actual RPC response struct that will hit the wire
> when the RPC completes. As asynchronous sender cannot know the batch size,
> which may change on every fetch call. So the {{GetNext()}} implementation
> would need to be able to split out the {{QueryResultSet}} to match the
> correct fetch size, and handle stitching together other {{QueryResultSets}} -
> without doing extra copies.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]