juliuszsompolski opened a new pull request #26014: [SPARK-29349] Support 
FETCH_PRIOR in Thriftserver fetch request
URL: https://github.com/apache/spark/pull/26014
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other 
DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   -->
   
   ## What changes were proposed in this pull request?
   
   Support FETCH_PRIOR fetching in Thriftserver, and report correct fetch start 
offset it TFetchResultsResp.results.startRowOffset
   
   The semantics of FETCH_PRIOR are as follow: Assuming the previous fetch 
returned a block of rows from offsets [10, 20)
   * calling FETCH_PRIOR(maxRows=5) will scroll back and return rows [5, 10)
   * calling FETCH_PRIOR(maxRows=10) again, will scroll back, but can't go 
earlier than 0. It will nevertheless return 10 rows, returning rows [0, 10) 
(overlapping with the previous fetch)
   * calling FETCH_PRIOR(maxRows=4) again will again return rows starting from 
offset 0 - [0, 4)
   * calling FETCH_NEXT(maxRows=6) after that will move the cursor forward and 
return rows [4, 10)
   
   It's intended to be used to recover after connection errors. If a client 
lost connection during fetching (e.g. of rows [10, 20)), and wants to reconnect 
and continue, it could not know whether the request  got lost before reaching 
the server, or on the response back. When it issued another FETCH_NEXT(10) 
request after reconnecting, because TFetchResultsResp.results.startRowOffset 
was not set, it could not know if the server will return rows [10,20) (because 
the previous request didn't reach it) or rows [20, 30) (because it returned 
data from the previous request but the connection got broken on the way back). 
Now, with TFetchResultsResp.results.startRowOffset the client can know after 
reconnecting which rows it is getting, and use FETCH_PRIOR to scroll back if a 
fetch block was lost in transmission.
   
   ### Backwards/forwards compatibility
   
   Old driver with new server:
   * Drivers that don't support FETCH_PRIOR will not attempt to use it
   * Field TFetchResultsResp.results.startRowOffset was not set, old drivers 
don't depend on it.
   
   New driver with old server
   * Using an older thriftserver with FETCH_PRIOR will make the thriftserver 
return unsupported operation error. The driver can then recognize that it's an 
old server.
   * Older thriftserver will return TFetchResultsResp.results.startRowOffset=0. 
If the client driver receives 0, it can know that it can not rely on it as 
correct offset. If the client driver intentionally wants to fetch from 0, it 
can use FETCH_FIRST.
   
   Driver should always use FETCH_PRIOR after a broken connection.
   * If the Thriftserver returns unsuported operation error, the driver knows 
that it's an old server that doesn't support it. The driver then must error the 
query, as it will also not support returning the correct startRowOffset, so the 
driver cannot reliably guarantee if it hadn't lost any rows on the fetch cursor.
   * If the driver gets a response to FETCH_PRIOR, it should also have a 
correctly set startRowOffset, which the driver can use to position itself back 
where it left off before the connection broke.
   * If FETCH_NEXT was used after a broken connection on the first fetch, and 
returned with an startRowOffset=0, then the client driver can't know if it's 0 
because it's the older server version, or if it's genuinely 0. Better to call 
FETCH_PRIOR, as scrolling back may anyway be possibly required after a broken 
connection.
   
   This way it is implemented in a backwards/forwards compatible way, and 
doesn't require bumping the protocol version. FETCH_ABSOLUTE might have been 
better, but that would require a bigger protocol change, as there is currently 
no field to specify the requested absolute offset.
   
   ## How was this patch tested?
   
   Added test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to