I have come upon an interesting problem with pagination that I was 
wondering if anyone else solved elegantly.  The problem can best be 
described by twitter's dev docs: 
 https://dev.twitter.com/rest/public/timelines.

Essentially, using the from and size parameters 
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html)
 
makes it very hard to get the correct documents for results page two if a 
document(s) has been added since page one was loaded and the index is 
sorted from newest to oldest.  Twitter suggests summing the offset or from 
param with the number of additional documents added since the previous 
request; however, with this solution we're reliant on the client having the 
correct count of documents added since the first page was loaded.

For example the following index contains documents sorted from newest to 
oldest:

E (newest)
D
C
B
A (oldest)

If each page has a single document the first page will have document E and 
the offset or from parameter for the next page will be 1 with expectations 
of getting document D on the second page (since there is one document per 
page); however, since the first page has loaded document G was added to the 
index.

Now the index looks like this:

G (newest)
E
D
C
B
A (oldest)

Using the offset or from parameter of 1 in this case will return document 
E...  Again.  This is NOT the intended functionality and would lead to 
duplicate documents being returned.


The only solution I've come up with doesn't seem ideal.

For the first page I'll perform the same actions as the example above. 
 Except in addition to returning document E the total number of documents 
in the index will be returned.  In the case of index E through A that would 
be 5 total documents.  Attempting to access additional pages thereafter 
will require providing the total number of documents obtained with the 
first request.  Let's call that startSize.  Otherwise we'll still have to 
pass the offset of 1.  On the second and all requests thereafter we'll 
invert our the sorting of the documents to be oldest to newest.

The inverted index will look like this:

A (oldest)
B
C
D
E
G (newest)

The amount of documents per page will be referred to as pageSize (or size 
param in ES).  The from parameter will be calculated using the following 
formula:

from = startSize - offset - pageSize
        = 5 - 1 - 1
        = 3

while

size = pageSize
       = 1

Using the inverted index and the calculated parameters that will give us 
document D or the expected result for page two prior to document G being 
added to the index.  On page 3 we'll get document C etc.

That formula will give us the expected results when working with indices 
that are: sorted from newest to oldest, are constantly growing, and are 
accessed with pagination.  I don't see this algorithm significantly 
increasing the cost of accessing the API but with that said I cannot help 
but think I've let the early hours of the morning get the best of me.

Is there a better solution or something built into elasticsearch to handle 
this use case?

Thanks in advance!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/61c3dbce-6383-4270-91f4-acfa23ffa2f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to