[
https://issues.apache.org/jira/browse/DRILL-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099404#comment-17099404
]
ASF GitHub Bot commented on DRILL-7730:
---------------------------------------
paul-rogers opened a new pull request #2075:
URL: https://github.com/apache/drill/pull/2075
# DRILL-7730](https://issues.apache.org/jira/browse/DRILL-7730): Improve web
query efficiency
## Description
Drill provides a REST API to run queries: `http://<host>:8047/query` and
`/query.json`. This PR improves the memory efficiency of these queries.
Drill runs queries as a DAG of operators, rooted on the "Screen" operator.
The Screen operator takes each output batch of the query and hands it over to a
`UserClientConnection` object. The original design is that
`UserClientConnection` corresponded to an RPC connection. So, the Screen
operator converted the vectors in the outgoing batch into a
`QueryWritableBatch` which is an ordered list of buffers ready to send via
Netty.
When the REST API was added, the simplest thing was to add a new
REST-specific version of `UserClientConnection`, called `WebUserConnection`.
Rather than sending our list of buffers off to the network, the web version
converts the buffers back into a set of value vectors using the same
deserialization code used in the Drill client. However, that deserialization
code needs the data in the form of a single large buffer. So, the REST code
copies the entire batch from the list of buffers into one large direct memory
buffer. Then it converts that back into vectors.
Clearly, all this work simply gets us back where we started: the Screen
operator has a batch of vectors, the `WebUserConnection` recreates them,
consuming lots of memory and CPU in the process. All of this work occurs in the
query thread (not the REST request thread), making the query more costly than
necessary.
So, the major part of this PR is to avoid the copy: allow the REST code to
work with the batch given to Screen.
This is done by creating a new level of indirection, the `QueryDataPackage`
class. Now, Screen simply wraps the outgoing batch of vectors in a data package
and hands that off to the `UserClientConnection`. The RPC version calls a
method which does the conversion from vectors into a list of buffers. But, the
REST version calls a different method which returns the original batch of
vectors. Voila, no more copying and no more extra direct memory overhead.
The `WebUserConnection` use the vectors to create three on-heap structures:
a list of column names, a list of column types, and a list of maps of rows. The
rows are particularly inefficient and will be addressed in a separate PR. As it
turns out, the code that handled the column and metadata list had a bug: every
incoming batch of data would append another copy to the in-memory list,
resulting in many redundant objects. That bug is fixed in this PR.
The work to understand all this resulted in "grand tour" of parts of Drill.
Much code cleanup resulted. Also, WebUserConnection` is split into two classes
as part of the next phase (removing the on-heap buffered results.)
## Documentation
N/A: the user visible behavior of Drill is unchanged (though REST queries
might be a bit faster.)
## Testing
Reran all unit tests. Though, to be fair, the test suite include basically
no tests of the REST API. The test run instead ensured that nothing was broken
in the main RPC pathway.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Reduce overhead of web queries displayed in HTML
> ------------------------------------------------
>
> Key: DRILL-7730
> URL: https://issues.apache.org/jira/browse/DRILL-7730
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.17.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Fix For: 1.18.0
>
>
> Drill provides a web console to run queries. Query results appear as HTML
> pages. Drill buffers the query results in-memory to build the page. The
> current approach has two problems (in addition to the overhead of buffering):
> * To move each batch from Screen to the REST client, we serialize all vectors
> into a single large buffer, then recreate the individual vectors.
> * The code appends column names and metadata for each batch. For a
> multi-batch query, we end up with lists that contain many copies of the same
> data.
> This change modifies the internal plumbing to transfer a record batch from
> Screen to REST without copying.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)