[RT] A new Forrest implementation?

Ross Gardler Mon, 14 Aug 2006 12:59:53 -0700

This is a Random Thought. The ideas contained within are not fully
developed and are bound to have lots of holes. The idea is to promote
healthy discussion, so please, everyone, dive in and discuss.


The Problem
===========

Forrest is built on Cocoon, a web application framework, but "all" it
does is XML publishing. This means we have a monolithic web application
framework that is doing nothing more than managing a processing pipeline
and doing XSLT transformations.

Let me try to illustrate...

What Forrest Does
=================

Input -> Input Processing -> Internal Format -> Output Processing ->
Output Format

To do this we need to:

- locate the source document
- determine the format of the input document
- decide which input plugin to use
- generate the internal format using the input plugin
- decide what output plugin we need
- generate the output format using the output plugin

Lets look at each of these in turn

Locate the source document
--------------------------

To do this we use the locationmap, this is Forrest technology.

Determine the Format of the Input Document
------------------------------------------

This is either done by:

a) guessing the source format based on the file extension
b) reading the source format from the document itself (SourceTypeResolver)

a) is a "standard" way of doing things and b) is Forrest technology

Decide which input plugin to use
---------------------------------

This is done by resolving the processing request via the Cocoon sitemap.
But why?

Each input type should only be processed by a single input plugin, there
should be no need for complex pipeline semantics to discover which
plugin to apply to a document, all we should need to do is look up the
type of document in a plugins table.

Generate the internal document
------------------------------

This is typically done by an XSLT transformation, but may be done by
calling the services of a third party library (i.e. chaperon)

Either of these actions are easy to achieve through simple Java code,
however, we currently "benefit" from the fact that Cocoon transformers
are already implemented to do these transformations for us. It is true
that Cocoon provides a load of such transformers for us, but how many do
we actually use? How complex are they to write as a POJO? How complex
are they to write as a Cocoon transformer.

My point is that in this instance the Cocoon complexities are making it
harder for developers to get involved with Forrest and so they simply
don't get involved.

Decide what output plugin to use
--------------------------------

This is done by examining the requested URL. The actual selection of the
output plugin is done within the Cocoon sitemap. I have all the same
arguments here as I do for input plugins, this only needs to be a simple
lookup, not a complex pipeline operation.

Generate the output format
--------------------------

This is typically done by an XSLT transformation and/or by a third party
library (i.e. FOP) I have the same arguments here as I do for the
generation of internal format documents, in fact the parts of Cocoon we
use are identical in both cases.

So why do we use Cocoon?
========================

We can see that we use Cocoon for:

- selecting the correct plugin to apply
- convenience of transformation from one format to another
- a nice pipeline implementation that allows the processing to be
streamed as SAX events rather than DOM processing
- An efficient caching mechanism

Lets look at each of these uses in reverse order:

Caching
-------

Cocoons Caching mechanism is pretty good, but it has its limitations
within Forrest. In particular, we have discovered that the Locationmap
cannot be cached efficiently using the Cocoon mechanisms. This is now
one of the key bottlenecks in Forrest.

We could work with Cocoon on their caching mechanism but there seems
little interest in this since our use case here is quite unusual. Of
course, we can do the work ourselves and add it to Cocoon. But why not
use a cacheting mechanism more suited to our needs?

SAX Events
----------
Although Cocoon was one of the first web frameworks to use this
technique there are now many implementations of such a pipeline
processing. We should therefore not consider ourselves tied to this
implementation. However, we do need to stick to streaming SAX events for
performance reasons.

Ready Made Transformations
--------------------------

The vast majority of our transformations are standard XSLT, there is no
magic in the Java code that does this. The remaining transformations
are handled by third party code that we can reuse in any context.

The *small* amount of code that we get to reuse by using Cocoon
Transformers is offset by the internal complexity of building new
transformers. Cocoon is designed as a web application framework and as
such it tries to be all things to all users. This has resulted in a
really complex internal structure to Cocoon.

This complexity makes it difficult for newcomers to get started in using
Forrest for anything other than basic XSLT transformations.

The end result is that we have only one type of user - those doing XSLT
transformations.

Plugin Selection
----------------

This is done through the sitemap. This is perhaps where the biggest
advantage of Cocoon in our context can be found. The sitemap is a really
flexible way of describing a processing path.

However, it also provides loads of stuff we simply don't need when all
we are doing is transforming from one document structure to another. This
makes it complex to new users (although having our own sitemap
documentation would help here).

Finally, as discussed in the previous section, we don't need a complex
pipeline definition for our processing, we just need to plug an input
plugin to an output plugin via our internal format and that is it. We
have no need for all the sitemap bells and whistles.

Conclusion
----------

Cocoon does not, IMHO, bring enough benefits to outweigh the overhead of
using it.

That overhead is:

- bloat (all those jars we don't need)
- complex code (think of your first attempt to write a transformer)
- complex configuration (sitemap, locationmap, xconf)
- based on Avalon which is pretty much dead as a project

So Should We Re-Implement Forrest without Cocoon?
=================================================

In order to find an answer to this question lets consider how we might
re-implement Forrest without Cocoon:

Locate the source document
--------------------------

We do this through the locationmap and can continue to do so. We would
need to write a new locationmap parser though. This would simply do the
following (note, no consideration of caching at this stage, but there
are a number of potential cache points in the pseudo code below):

/**
 * An entry in a locationmap that is used to resolve the location of a
 * resource. A Loation is one or more possible locations, represented by
 * a URL.
 */
public class Location {
  private List<URL> urls;

  /**
   * Create a location for a given match pattern that has multiple
   * possible source locations.
   * Each location will be tried in turn until a suceful match is found.
   */
  public Location(Pattern matchPattern, SelectNode node) { ... }

  /**
   * Create a location for a given match pattern with a single
   * possible source location.
   */
  public Location(Pattern matchPattern, LocationNode node) { ... }

  /**
   * Look through the possible locations for a requested resource
   * and return the first matching location we have.
   * Returns null if no appropriate location is found.
   */
  public URL findURL(String request) { ... }
}

public class Locationmap {
  private List<Location> locations;

  /**
   * Record all match nodes in the locations map. Each location group
   * is keyed on by match pattern for that location match.
   */
  private void init() { ...  }

  /**
   * Find the first valid location for a given request string.
   */
  public URL findURL(string request) {
    URL url;
    Iterator locations = locations.iterator();
    while (locations.hasNext()) {
      Location location = locations.next();
      url = location.findURL(request);
      if (url != null) return url;
    }
    return null;
  }
}

Determine the Format of the Input Document
------------------------------------------

Determining the inoput format from te extension is bad. URLs are
supposed to be independant of the document source. It would be better to
use the Mime Type, but this is not always configured correctly on
severs. Even when it is possible, it doesnt always give enough
information, for example with XML files. In this case, determining the
input format from the XML doctype is good, and we should continue to do
this.

I therefore propose that the non-XML resources and XML resources without
a schema definition should be resolved by an extension to the
locationmap syntax:

<map match="bar/**">
  <location src="http://someserver.com/foo/{1}"; mime-type="bar"/>
</map>

In the absence of a mime-type attribute we will use the mime-type
returned by the request. In the event of an XML resource we will use the
schema definition as before. Of course, we can always fall back to the
file extension if nothing else tells us the correct format.

This means that in the vast majority of cases we will not need to define
the type of document.


Decide which input plugin to use
---------------------------------

This is a simple lookup of the input format against the available
plugins. Therefore, a PluginFactory would do just fine here. This would
be configured by some external configuration system and plugins would be
loaded by a component manager such as Spring.

It is worth noting that the component manager configuration file is
likely to be sufficient for the plugin configuration file as well. So we
need not create yet another config file.

Generate the internal document
------------------------------

Since the plugins are now loaded via a component manager our
transformation classes are POJO's that are independant of any particular
execution environemnt, therefore, there is no need to do anything
clever here.

Decide what output plugin to use
--------------------------------

This is handled exactly the same as the input plugins. That is, we
provide a factory that provides the relevant plugin for any given
request. But how do we know which plugin to use?

At the moment we base this in the request URL. This is convenient and
easy, but it limits the URLspace unnecessarily. I don't propose we
remove this convenient way of doing things, but we can take the
opporuntity to add something better. For example, when defining the
plugins in the container we can provide a set of match patterns that
should be applied to a request. The first plugin that matches is the one
selected. For example:

<bean id="fop" class="org.apache.forrest.plugin.output.fop">
  <property name="matchPattern" value="*.pdf|*/pdf|pdf.*"/>
</bean>

The above plugin would be applied to requests such as:

http://localhost:8888/foo/bar.pdf
http://localhost:8888/foo/bar/pdf
http://localhost:8888/pdf/foo/bar

Using this approach, in conjunction with the locationmap on the input
side, gives us full control over both the input and output URL space.
That is we will no longer have to reserve specific URLs for specific
plugins, therefore conflicts between plugins from diferent providers can
be resolved by the user.

Generate the output format
--------------------------

Again, our plugins are now POJO's so no magic here, just load up the
plugin controller and execute it.


But what about internal plugins?
-------------------------------

Internal plugins currently work by overriding existing matches in theinternal sitemaps. This works just fine, as long as you only want tomodify each match with a single plugin. It is possible to haveconditional execution, but the sitemap was not designed for this and itgets real clunky, furthermore each extending plugin needs to be aware ofthe functionality of the one it is extending, thuse we end up withhard-wired dependencies between plugins.

In this new implementation there would be two ways of creating internalplugins. One would be to extend the java classes that represent core.This would be OK, but would suffer some of the same problem as theexisting mechanism, although conditional execution would be eaier to manage.

The second way to implement internal plugins would be through AOP, wherenew funcitonality can be inserted at certain points in the executionstack of the core. In this instance an unlimited number of plugins canbe applied, each handling their own specific cases. There is not needfor complex execution logic in each implementation and each plugin wouldbe completely independant of the others. [CAVEAT: although I'm not tooexperienced with AOP so can't promise this last para is 100% correct].


What do we gain?
================

By getting rid of Cocoon Forrest becomes much more lightweight, hasconsiderably less dependencies (I estimate it would be just a couple ofmegabytes, for the core, if that). As a result it becomes more embeddable.

We could still build a webapp allowing for live rendering of content,this would be a simple servlet wrapped around the core. We could embedForrest in other webapplications (Daisy, CMS, Wicket, Struts etc. etc.)by providing a servlet filter or using AOP, and we could Forrest intodesktop applications as well. (I'm sure there are other potential uses Ihave not considered.

All this makes Forrest attractive to a much wider audience. We would nolonger hear our users saying "it's just too big for a site generationtool". Possibly because it wouldn't be a site generation tool, it wouldbe the publishing framework we already claim it is. Of course, it wouldbe able to generate sites, just like it does today, but that would justbe one of its uses.

As a result of all this we may find ourselves attracted more users whoare willing to dive in and do something useful with the internals. Thiscan only be a good thing, there are too few users being converted totruly active developers for the project to survive. In my opinion thisis because to do anyting other than generate a website is just toodamned difficult right now.

Of course, a CLI that produces a static rendering of a complete documentobject would be easy enough to implement, although wget or similarpointed at the servlet described above would suffice.

All in all, this makes Forrest much more useful in a single sourcepublishing environment in which many users, with many differentapplications, see a document (which can be in many formats) in manydifferent forms.

Note that this last sentence is very like our current 50 word projectdescription, but it adds the "many different applications" part.Something I beelieve is a necessity in a framework (which is whatForrest is intended to be, even if our users see it as something different).

Finally, we get a much cleaner implementation that is easy tounderstand. Debugging is simplified and we since eveything is now a POJOwe can use unit testing much more effectively.


The Drawbacks
=============

What are the drawbacks of getting rid of Cocoon?

Probably the biggest drawback would be that we have to code it. There's
not a huge amount of work to be done, but there are some neat things in
Cocoon that we would have to reimplement or find elsewhere. We may be
able to extract some of the code from Cocoon, but I'm not convinced this
is a good approach.

Because of this need to write the code it will mean that Forrest would
initially take a step backwards in terms of its functionality.

If we do it how do we manage it?
================================

Since the vast majority of our users do nothing more than generate a website at present I think that we can do this without any major changes toexisting projects. However, I do not think it is a good idea to insiston backward compatability.

I think the best approach would be to get the next release of Forrestout as a 1.0. Yes, that's right, *1.0*

Forrest is stable, there are almost no major changes that are backwardincompatible with 0.7 in the next release. So lets bite the bullet andget it out there.

Then we scale down development on the 1.0 branch going into maintenanceonly. Therefore this new development would be a new 2.0 trunk (yes, newtrunk as in we have none of the existing code in there by default).


----

So is this interesting or not?

Ross

[RT] A new Forrest implementation?

Reply via email to