siddharthteotia commented on issue #5246:
URL: https://github.com/apache/pinot/issues/5246#issuecomment-1183488539

   At LinkedIn, we have started to work on pagination on priority considering 
multiple requests we have received internally.
   
   At a high level, our customer requirements are around the fact that they 
want to run query in Pinot that can potentially return a large response and 
user app does not want to accept the entire response in memory at once and want 
the ability to paginate the response as multiple result sets (size per result 
set dictated by the user app).
   
   The current pagination implementation in Pinot (even if it is just for 
selection query) is sub-optimal in the sense that it takes each query as a 
fresh query and executes the query again and again for every pagination window, 
discard the results outside the window and provides the result within the M, N 
window that user has asked for. 
   
   The main thing to note about pagination is that it has to be treated as a 
single query. 
   
   Our customers don't want to run a one-off pagination query OFFSET M, FETCH N 
where M and N are completely random in which case it is not possible to reason 
about the results and it's even hard for the user to decide M as a one-off 
starting point.
   
   The semantics that we want to provide is that "I want to fetch 10 million 
records from Pinot for a query and want to fetch 100K at a time"
   
   So the customer will typically start with M as 0 and might just keep N fixed 
(say at 10000 or so) and just keep paging the results through multiple calls 
from their app which simply changes M during every call (and they potentially 
refresh the results in UI etc returned by Pinot in every call).
   
   I think we should look at the pagination problem from this perspective as 
opposed to a random one-off pagination query because M and N don’t make any 
sense for a random query. Result of a random pagination query doesn’t add any 
value to the user since they want to look at the entire result as a continuous 
stream of results with the will to stop anytime. 
   
   We are trying to tackle the problem from the above perspective when trying 
to provide pagination semantics. Detail design discussion is in progress.
   
   Some more thoughts slightly related to this -- 
   
   Now, one problem is that users who run such queries may have the tendency to 
think that support for pagination means they can run "any" query in Pinot that 
can be very long running and Pinot is guaranteed to finish it and provide 
results. This can easily cause OOM (out of memory) and bring down the cluster. 
   
   Pinot is unlikely to enter the territory of running very long running 
queries and getting the entire 100% accurate result by spilling to disk and 
avoiding OOM at all times. Presto should be used for those cases. 
   
   However, for some of our users (who are ok with multi-second latency and 
prefer slightly more accurate response for GROUP BY queries),  as a follow-up / 
next phase, we want to consider enhancing support in Pinot for queries that 
return large responses and/or process / aggregate more than usual amounts of 
data. We want to do this by doing some of the memory intensive query execution 
operations in off-heap (direct) memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to