[ 
https://issues.apache.org/jira/browse/DRILL-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269869#comment-17269869
 ] 

ASF GitHub Bot commented on DRILL-7733:
---------------------------------------

paul-rogers opened a new pull request #2149:
URL: https://github.com/apache/drill/pull/2149


   
   
   
   
   # [DRILL-7733](https://issues.apache.org/jira/browse/DRILL-7733): Use 
streaming for REST JSON queries
   
   ## Description
   
   Modifies the REST API to stream JSON query results rather than buffering the 
entire result set in memory as was previously required. The buffering limited 
the size of query which could be run using the REST API: users would run out of 
memory. With the streaming solution, data is fed directly from the query result 
to a JSON encoder and then back to the HTTP client with no buffering.
   
   Note that Drill has historically put the result schema *after* data. The 
reasoning was likely that the query schema can change many times during a query 
run (with different fragments returning batches with differing schemas.) The 
schema-at-end model allows the schemas to be merged.
   
   However, with streaming, the schema-at-end model forces the client to buffer 
the entire result set if the client needs the schema. A good improvement would 
be to send the (first batch) schema *before* the data. Drill would somehow have 
to deal with schema changes. As it turns out,  ODBC and JDBC clients send the 
schema before data and thus suffer from the same schema-change problem 
described here. We've avoided having to address the ODBC/JDBC issue, so maybe 
it won't be a problem in practice for the REST API if we send the first batch 
schema before data. In any event, that would be a (simple) separate enhancement.
   
   Refactors the existing JSON writer to work with the result set mechanism 
which is then used as the implementation for streaming.
   
   Refactors the internals of the REST API to allow for traditional "batch" 
responses and the new streaming responses.
   
   Revises the date/time methods for the row set API to use Java classes rather 
than Joda. Required to integrate properly with the
   JSON writer. The Joda Period class remains as there is no Java equivalent. 
Most of the changed files, in fact, are for this date/time change.
   
   A recent PR added get/set float methods to the row set API. This change was 
redundant and added a large volume of code to avoid a single-instruction cast 
and so is questionable. However, since we made it, we need to make it work. 
This PR fixes a few holes found during this work.
   
   ## Documentation
   
   The streaming form of JSON output is used only for REST queries: 
`query.json`. It is not used for HTML. The change is invisible to the user 
except that there is no longer a limit to the size of query results that the 
REST API can return.
   
   The Joda-to-Java time implementation change should be transparent to users 
except in one very specific case: if users have created a provided schema that 
includes a date/time format string. Such strings must be updated to Java 
date/time format. Provided schema is, however, an obscure feature so it is 
likely any users are affected.
   
   ## Testing
   
   Most changes are for the Joda replacement. All tests were rerun and updated 
as needed. Drill previously had no unit tests for the REST API. This PR adds a 
few simple tests, and instructions for how to quickly use the test to do ad-hoc 
tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use streaming for REST JSON queries
> -----------------------------------
>
>                 Key: DRILL-7733
>                 URL: https://issues.apache.org/jira/browse/DRILL-7733
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.19.0
>
>
> Several uses on the user and dev mail lists have complained about the memory 
> overhead when running a REST JSON query: {{http:://node:8047/query.json}}. 
> The current implementation buffers the entire result set in memory, then lets 
> Jersey/Jetty convert the results to JSON. The result is very heavy heap use 
> for larger query result sets.
> This ticket requests a change to use streaming. As each batch arrives at the 
> Screen operator, convert that batch to JSON and directly stream the results 
> to the client network connection, much as is done for the native client 
> connection.
> For backward compatibility, the form of the JSON must be the same as the 
> current API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to