Paginating Content

Stefano Mazzocchi Thu, 06 Jun 2002 05:55:19 -0700

Konstantin Piroumian wrote:

> Hm... Does anybody have an idea on how to paginate the content?


Ok, damn it, I don't have time to make mark this up, but since it's the
content that is useful, here's a small tutorial for the Paginator.

                                   - 0 -

Paginator Transformer
=====================

classname: org.apache.cocoon.transformation.paginatation.Paginator
location: scratchpad (available in both cocoon 2.1-dev and 2.0.3-dev)

Design idea
-----------

The paginator is a 'FilterTransformer' on pagination steroids. It works
filtering SAX events things out and counting page.

The design isn't very efficient since it has to process the entire file
to extract a single page. It works nicely with few tens of pages, but I
would seriously suggest *against* using it for books or very big
documents.

The good news is that its cacheable, so if the document doesn't change
and the same page is requested, there is no need to reprocess the
document.

Anyway, for static generation, all this doesn't really matter.

A simple example of use
-----------------------

Suppose you have an XML file like this

 <a>
  <b/>
  <b/>
  <b/>
  <b/>
  <b/>
  <b/>
  <b/>
 </a>

and you want to paginate this having 3 <b> elements per page. In order
to achieve this, you write a simple "pagesheet" (which contains the
instructions for the filter, much like a stylesheet gives instructions
to an xslt processor) like this:

<?xml version="1.0"?>
<pagesheet xmlns="http://apache.org/cocoon/paginate/1.0";>
 <rules>
  <count type="element" name="b" num="3"/>
 </rules>
</pagesheet>

then you connect the two with a sitemap snippet like this:

   <map:match pattern="page(*)">
    <map:generate src="document.xml"/>
    <map:transform type="paginate" src="pagesheets/images.xml">
      <map:parameter name="page" value="{2}"/>
    </map:transform>
    <map:serialize type="xml"/>
   </map:match>

and accessing the URI page(1) yields

 <a>
  <b>
  <b>
  <b>
  <page:page xmlns:page="http://apache.org/cocoon/paginate/1.0"; 
     current="1" 
     total="3"
     current-uri="page(1)"
     clean-uri="page"
  />
 </a>

which can be easily transformed into something more meaningful.

Note that the transformer processes all the pages to obtain the 'total'.
There is no way around this.

Adding navigation
-----------------

The problem with XSLT-based pagination is that the logic is very complex
to define in XSLT and is rarely reusable across different pagination
needs. This was the main reason for the creation of a custom components
for this.

But since we have a full blown pagesheet language, there are a few other
things that we can make the Paginator do, most important, navigation.

For example, with this other pagesheet

<?xml version="1.0"?>
<pagesheet xmlns="http://apache.org/cocoon/paginate/1.0";>
 <rules>
  <count type="element" name="b" num="3"/>
  <link type="unit" num="1"/>
 </rules>
</pagesheet>

indicates that the transformer must understand how the page was encoded
in the given URI and provide a link to the pages +/- 1 position, if they
are available.

So, using the same environment as before we get

 <a>
  <b>
  <b>
  <b>
  <page:page xmlns:page="http://apache.org/cocoon/paginate/1.0"; 
     current="1" 
     total="3"
     current-uri="page(1)"
     clean-uri="page">
   <page:link page="2" type="next" uri="page(2)"/>
  </page:page>
 </a>

which indicates

 1) there is no page 0, so no link is created.
 2) the link goes to page 2, the type is 'next' (useful for
visualization) and the URI is page(2) (useful for linking without
XSLT-specific logic).

NOTE: the URI is re-encoded using the same pattern, this paginator
assumes that the 'round brakets' are used to identify page numbering.

Now, without changing anything, requesting page(2) would yield

 <a>
  <b>
  <b>
  <b>
  <page:page xmlns:page="http://apache.org/cocoon/paginate/1.0"; 
     current="2" 
     total="3"
     current-uri="page(2)"
     clean-uri="page">
   <page:link page="1" type="prev" uri="page(1)"/>
   <page:link page="3" type="next" uri="page(3)"/>
  </page:page>
 </a>

while page(3) would yield:

 <a>
  <b>
  <page:page xmlns:page="http://apache.org/cocoon/paginate/1.0"; 
     current="3" 
     total="3"
     current-uri="page(3)"
     clean-uri="page">
   <page:link page="2" type="prev" uri="page(2)"/>
  </page:page>
 </a>

NOTE: here there is only one <b> because the original document doesn't
contain enough elements to fill the page entirely. It's the modulo of
the division.

A real-life example
-------------------

Here are a few pagesheets which are a little more complex:

Paginating the results from DirectoryGenerator:

<?xml version="1.0"?>
<pagesheet xmlns="http://apache.org/cocoon/paginate/1.0";>
 <rules>
  <count type="element" name="file"
namespace="http://apache.org/cocoon/directory/2.0"; num="16"/>
  <link type="unit" num="2"/>
  <link type="range" value="5"/>
 </rules>
</pagesheet>

This says:

 1) paginate 16 files per page
 2) provide me with links to +/- 1 and +/- 2 pages (when available)
 3) provide me with linkts to +/- 5 (when available)

So, suppose we have a directory with 300 files and we request page 10,
the generated page will be

 <dir:directory>
  <dir:file ...>

  [other 15 dir:file]

  <page:page xmlns:page="http://apache.org/cocoon/paginate/1.0"; 
     current="10" 
     total="19"
     current-uri="dir(10)"
     clean-uri="dir">
   <page:range-link page="5" type="prev" uri="page(5)"/>
   <page:link page="8" type="prev" uri="page(8)"/>
   <page:link page="9" type="prev" uri="page(9)"/>
   <page:link page="11" type="next" uri="page(11)"/>
   <page:link page="12" type="next" uri="page(12)"/>
   <page:range-link page="15" type="next" uri="page(15)"/>
  </page:page>
 </dir:directory>

Asymmetric pagination
---------------------

We have also the ability to indicate different rules for each page, so:

<pagesheet xmlns="http://apache.org/cocoon/paginate/1.0";>
 <rules page="1">
  <count type="element" name="b" num="5"/>
  <link type="unit" num="1"/>
 </rules>
 <rules>
  <count type="element" name="b" num="10"/>
  <link type="unit" num="2"/>
 </rules>
</pagesheet>

Count types
-----------

The paginator works by counting stuff. It's up to you to define what you
want to use for counting and you do so with the attributes of the
<count> element in the pagesheet.

This element supports 2 required attributes:

 num="" -> a number indicating how many times the thing to count must be
present in this page.

 type="" -> the type of counting that the paginator must perform. Only
one type is currently implemented and two are currently supported.

    type="element" -> makes the paginator counts the startElement() SAX
events

    type="chars" -> (not currently implemented!) makes the paginator
count the chars inclued in the page.

In case type="element" is used, two other attributes become useful:

 name="" -> the name of the element (without namespace prefix!)

 namespace="" -> the URI of the namespace (if not specified, the default
NS is used)

                                      - o -

Ok, from now on some RT on the future of this transformer:

Using the paginator for docs
----------------------------

I originally wrote the paginator to paginate a directory listing and it
works great for paginating counting elements. For docs, it could be
possible to paginate by counting sections or subsections, but this
doesn't necessarely yield visually balanced pages (which is the reason
for web pagination).

This is why I assumed a way to count by chars, even if I didn't go as
far as implementing it because while paginating by counting elements is
ok (sounds trivial, but it's not! think of nesting!) paginating by
counting chars is a real pain, due to the algorithms that must perform
'chunking'.

I mean, assume you have a document like this:

 <p>this is some <strong>text</strong> that happens 
 to be <em>chuncked</em></p>
             ^
             |
                                               
and suppose that counting the chars leads you to the chunking point
indicated by the arrow above. Cutting the page there results in XML
which is not well-formed. Providing a way to 're-well-form' the XML
truncates words. So, we must provide a way to 're-well-form' the XML
until the first 'block-level' element is encountered (p in this case).
But this means that the pagesheet must contain at least the list of
'block-delimiting' elements (and the current Pagesheet parser parser and
object model doesn't support this notion).

Result: pagination at the char-level is not trivial and requires a
little bit of work on the transformer

Nesting behavior
----------------

If counting by chars is a pain, even counting elements is not easy.
Assume you have this:

 <a>
  <b>
   <a>
    <b>
     <a>
      <b/>
     </a>
    </b>
   </a>
  </b>
 </a>

and you want to paginate using one <b> per page, what do the pages look
like? ok, I'll give you some space to think about it.





















Ok, here is my solution (but I'm not sure it's the best):

page 1:

 <a>
  <b>
   <a>
    <a/>
   </a>
  </b>
 </a>

page 2:

 <a>
  <a>
   <b>
    <a/>
   </b>
  </a>
 </a>

page 3:

 <a>
  <a>
   <a>
    <b/>
   </a>
  </a>
 </a>

I'm pretty sure the current code is buggy someplace because for deep
nesting like this one, it looses some SAX events someplace and ends up
making the SAX stream non-well-formed and chocking the subsequent
transformers which are sensible to well-formness (such as XSLT).

Note: the above might look like a mental exercise to many, but if you
think about our Document DTD 1.1, you'll find nested <section> and
paginating those results in very similar problems. But I'm not sure if
the solution adopted above is meaningful for a real-case pagination. I'm
up to suggestions in on this.

Improving the concept
---------------------

One possible way to improve the concept is to count by XPath results,
that is you might want to count by 'sections included in sections'.

Also, another way to improve the system is providing booleans: you might
want to count 'sessions AND chapters' (probably, XPath helps here as
well).

Ok, anyway, hope this helps and sorry for taking so long to write this.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Paginating Content

Reply via email to