Re: RFC-5005 best practices

Bryon Jacob Fri, 14 Dec 2007 13:31:26 -0800

I haven't read RFC-5005, but we're building a data service that soundssomewhat similar to what you're doing, so I'll chime in on pagination...

Along the same lines as what David said, what we've done to addresspagination is a solution based on the way Google has integratedOpenSearch into its querying APIs for GData.

implementation-wise, the trick is to add a monotonically increasingsequence number to every entry that we store, which we then use to getvery stable pagination when we pull feeds.


here's how it works:

- as each entry is written to the store, it gets the next sequencenumber to be assigned.- if an entry is later updated, it's previous sequence number isoverwritten with the new next number.- we've modified our feed urls to accept an optional "start-index"request parameter, which is a lower bound on sequence numbers to return.- when there are more results than the requested page-size, we add a"next" link to the feed we return, which is a feed with the same URIas the feed requested, but with the start-index set to one higher thanthe highest sequence number on this page.


what this guarantees is:

- you will never miss any data because of data changing while youread the feed (very important to us, since, like you, we are usingthis as a back-end data service to keep data synced across systems)- you will only ever see the same entry occur twice during a feedpull if it was in fact updated during the course of your paginatingthrough the data (in which case, you would want to get it twice to bemaximally in sync!)


here's Google's reference on the subject ==> 
http://code.google.com/apis/gdata/reference.html

In case my explanation above is a bit fuzzy, I'll give a simpleexample of how this all works:

let's say we start off with a totally empty feed store of "colors",and someone POSTS the entries RED, BLUE, GREEN, and YELLOW to the feed- our DB table that stores entry meta data would look like:


id                      sequence_num
RED             1
BLUE            2
GREEN           3
YELLOW  4

and the query we use to get the page of results, in pseudo-xml issomething like:

        SELECT TOP page_size id
        FROM meta_data
        WHERE "it matches my feed request URI"
        AND sequence_num >= start_index
        ORDER BY sequence_num

so, if someone came along to pull the feed http://my.server/data/colors?page-size=3, they would get (in pseudo-feed-xml-ish):

        <feed>
                <entry id="RED">...</entry>
                <entry id="BLUE">...</entry>
                <entry id="GREEN">...</entry>

<link rel="next" href="http://my.server/data/colors?page-size=3&start-index=4"/>

        </feed>

then, if they followed the link http://my.server/data/colors?page-size=3&start-index=4, they would get:

        <feed>
                <entry id="YELLOW">...</entry>
        </feed>

that's it for the simplest case -- if, however, in the time betweenwhen the user pulled the first page of the feed and the second,someone had PUT an update to GREEN, and POSTED a new entry PURPLE, theDB table would now look like:


id                      sequence_num
RED             1
BLUE            2
GREEN           5                                       -- GREEN is updated to 
5 by the PUT
YELLOW  4
PURPLE  6                                       -- PURPLE is then inserted with 
sequence_num 6

and when the user follows the link http://my.server/data/colors?page-size=3&start-index=4, they would instead get:

        <feed>
                <entry id="YELLOW">...</entry>
                <entry id="GREEN">...</entry>
                <entry id="PURPLE">...</entry>
        </feed>

note that they have now gotten GREEN twice during the "same" feedpull, but that's as it should be, because GREEN changed between page 1and page 2 of the feed.

it's a nice side effect of this solution that it works just fine nomatter how long you, as the client, take to process a page of results,or how long you choose to wait between pages - whenever you requestthe next page, it is guaranteed to simply be "the next N significantchanges after the page I previously pulled.", where N is your page size.

one more thing that's maybe worth noting is this - notice I said"significant" changes -- this solution hides from you any changes thatwere superseded before you got around to them -- which is almostcertainly what you want, but it's worth being aware of. if, after ourclient had pulled the first page of results back, the following thingshad happened:

        PUT GREEN
        POST PURPLE
        PUT GREEN (again)

the DB would look like:

id                      sequence_num
RED             1
BLUE            2

GREEN 7 -- GREEN is updated to 5 by the first PUT, then to 7 bythe SECOND

YELLOW  4
PURPLE  6                               -- PURPLE is inserted with sequence_num 
6

and when the user follows the link http://my.server/data/colors?page-size=3&start-index=4, they would instead get:

        <feed>
                <entry id="YELLOW">...</entry>
                <entry id="PURPLE">...</entry>
                <entry id="GREEN">...</entry>
        </feed>

note that on page one, we saw the inital revision of GREEN, and onpage 2 we see the THIRD revision of GREEN -- we never saw the second.again, unless you're doing something pretty unusual with Atom feeds,you probably don't care, because if you HAD gotten the second, youwould have just overwritten it with the third - but it's worth beingaware of what's really happening.

Hope this helps - if you have any more questions (or critiques!) aboutthis strategy, we'd love to hear them. thanks!


- Bryon



On Dec 14, 2007, at 2:40 PM, David Calavera wrote:

why don't you use the openSearch format?

On Dec 14, 2007 8:57 PM, Remy Gendron <[EMAIL PROTECTED]> wrote:
My APP server isn't used in the context of a standard feedprovider. It
will
be more of a web interface to a backend data server. We areleveragingAtom/APP/REST as a generic data provider interface for our webservices.
That's why your suggestion, although pretty good for feeds, is not
applicable here. I really want to chunk large datasets/searchresults.
I am also willing to live with some infrequent inconsistencies while
scanning the pages following concurrent create/delete ops.
My question was really about naming conventions when providing thepage
size
and page index as URL parameters.

Thanks again,

- Remy


-----Original Message-----
From: James M Snell [mailto:[EMAIL PROTECTED]
Sent: 14 December 2007 13:57
To: [email protected]
Subject: Re: RFC-5005 best practices

I've implemented paging a number of times.  The easiest approach has
always been to use page and pagesize.  Doing so, however, has it's
disadvantages.  For one, the pages are unstable -- that is, as new
entries are added to the collection, the entries slide through thepagesmaking it difficult for a client to completely and consistentlysync upthe changes. An alternative approach would be to based paging ondate
ranges, each each page could represent all entries modified within a
given period of time. Such pages will generally be much lessvolatile
over time.

- James

Remy Gendron wrote:
Hello all,
I'm implementing paging in my Abdera server. FeedPagingHelpercovers the
spec…
But do you recommend any best practices on passing in theparameters?
(pageSize, pageIndex)



I haven't seen any recommendations from Abdera… Do you recommend
Google's GData query extensions?



Thanks a lot for the great implementation!



Rémy





[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>

418 809-8585

http://www.arrova.ca <http://www.arrova.ca/>




No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
2007.12.14 11:29
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
2007.12.14
11:29


No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
2007.12.14
11:29
--
David Calavera
http://www.thinkincode.net

Re: RFC-5005 best practices

Reply via email to