[ 
https://issues.apache.org/jira/browse/PHOENIX-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467137#comment-15467137
 ] 

ASF GitHub Bot commented on PHOENIX-2606:
-----------------------------------------

Github user ankitsinghal commented on the issue:

    https://github.com/apache/phoenix/pull/192
  
    @anirudha , thanks for the pull request. Is it possible for you to change 
the JIRA subject to include PHOENIX-2606 or link somehow, so that we can see 
the comment logged on JIRA as well.
    
    
    With current implementation,  every query is modified to include primary 
key columns and tries to use RVC, which may produce improper results in 
following cases and also sometimes there is no advantage in using it as they 
don’t limit scan.
    
    - duplicate values for a column
    - order by on non-primary key axis.
    - Primary key columns having null value.
    - Aggregate queries
    
    
    @samarthjain can you please confirm as per your observation for what all 
queries RVC cannot be used?
    
    @anirudha 
    To abstract the query complexities and for initial support of Cursors, we 
should not modify any query but instead we can keep the ResultSet object open 
for corresponding cursor(with timeout) and start caching rows as we proceed 
further with next() calls(FETCH NEXT FROM cursor) , the cache will be used for 
previous() calls(FETCH PRIOR FROM cusror) on resultSet. (cache will also be 
used for next() calls if we go previous() in the cache).​
    
    Pros/Cons of above approach:-
    
    Pros:-
    
    - Highly abstracted, We don’t need to understand each and every query and 
develop logic separately for them.
    - As per the current implementation, We don’t need to expend ORDER BY(on 
non-primary key axis) all the time to include primary key column for 
uniqueness. As this will cause problem at the server because we will have more 
keys(almost all keys) to sort every time. (RVC will not restrict in this case).
    - Cache size can also be limited by the user and if we exhaust the cache , 
cache can be updated by using re-runing the query with LIMIT+OFFSET only 
    - An optimization can be done for flat queries(including INDEX queries) 
using last and peeked SCAN keys(instead of RVC to handle null and duplicate 
properly) for updating the cache instead of LIMIT+OFFSET.
    - Snapshot/Static queries can be provided by storing the compile time of 
OPEN CURSOR (and can be used to limit the scan upper bound for timestamp with 
it).
    
    Cons:-
    
    - As previous() is not supported on the server(Hbase), so cache overhead is 
there to maintain the results to support previous() at client.
    - Re-calculation of results after we reach cache limit or scanner timeout.
    
    
    @JamesRTaylor , WDYT? any suggestions


> Cursor support in Phoenix
> -------------------------
>
>                 Key: PHOENIX-2606
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2606
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Sudarshan Kadambi
>
> Phoenix should look to support a cursor model where the user could set the 
> fetch size to limit the number of rows that are fetched in each batch. Each 
> batch of result rows would be accompanied by a flag indicating if there are 
> more rows to be fetched for a given query or not. 
> The state management for the cursor could be done in the client side or 
> server side (i.e. HBase, not the Query Server). The client side state 
> management could involve capturing the last key in the batch and using that 
> as the start key for the subsequent scan operation. The downside of this 
> model is that if there were any intervening inserts or deletes in the result 
> set of the query, backtracking on the cursor would reflect these additional 
> rows (consider a page down, followed by a page up showing a different set of 
> result rows). Similarly, if the cursor is defined over the results of a join 
> or an aggregation, these operations would need to be performed again when the 
> next batch of result rows are to be fetched. 
> So an alternate approach could be to manage the state server side, wherein 
> there is a query context area in the Regionservers (or, maybe just a 
> temporary table) and the cursor results are fetched from there. This ensures 
> that the cursor has snapshot isolation semantics. I think both models make 
> sense but it might make sense to start with the state management completely 
> on the client side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to