Re: [RT] A new Forrest implementation?

Ross Gardler Tue, 15 Aug 2006 03:47:12 -0700

Tim Williams wrote:

On 8/14/06, Ross Gardler <[EMAIL PROTECTED]> wrote:

This is a Random Thought. The ideas contained within are not fully
developed and are bound to have lots of holes. The idea is to promote
healthy discussion, so please, everyone, dive in and discuss.

...

I think the Cocoon community has recognized the monolithic-ness of the
framework.  Stefano brought it up[1] and I think the responses are
encouraging - though the maven promises leave *very* much to be
desired as it has effectively stopped me from even attempting to build
their trunk.

It has been discussed a great many times. Some progress has been made,but I very much doubt it will happen in a time frame sufficient to helpForrest. The thread you link to is certainly not the first thathighlighted this issue.

What Forrest Does
=================

Input -> Input Processing -> Internal Format -> Output Processing ->
Output Format

To do this we need to:

- locate the source document
- determine the format of the input document
- decide which input plugin to use
- generate the internal format using the input plugin
- decide what output plugin we need
- generate the output format using the output plugin

Lets look at each of these in turn



Oversimplified but we'll see where you go with this...

Please expand. Please add in the complexities that you see so that wecan examine them.

Locate the source document
--------------------------

To do this we use the locationmap, this is Forrest technology.



A lot of avalon and excalibur + a very little Cocoon for context and
an (all things considered) wrapped up by a very little bit of Forrest
code.  I'm just suggesting that we've done nothing but wrapped some
stuff here - "forrest technology" is a stretch.  To recreate it, we
could get context elsewhere but we'd need an equivalent to
avalon/excalibur I think.

Come on, are you realy claiming that we need Avalon+Excalibur+Cocoon tocreate a hashmap of possible matches to any given string?

All we need is pattern matching followed by a lookup then a lookup. Seemy psuedo code later in the original post. The *concept* of theLocationmap is Forrest technology and it can be reproduced without anyof the baggage Cocoon requires us to bring along.

Decide which input plugin to use
---------------------------------

This is done by resolving the processing request via the Cocoon sitemap.
But why?

Each input type should only be processed by a single input plugin, there
should be no need for complex pipeline semantics to discover which
plugin to apply to a document, all we should need to do is look up the
type of document in a plugins table.



And aggregates?  The end result isn't a from a single document but an
aggregate of  multiple data uri's - at least that's the dispatcher
plan as I understand it.

All aggregates are about requesting multiple input sources and mergingthem together. Therefore aggregates do not belong here, they belong inthe output plugin stage (so I'll come back to this later)

> A cocoon transformer levies pretty

minimal requirement: an XMLConsumer/XMLProducer (easy and natural, sax
event handlers and a single method respectively) and some simple
lifecycle contract methods needed for being a part of the managed

environment.

I really should have been talking about the complexitites of writing agenerator. As we very rarely need to write transformers. Try writing agenerator that, for example, uses hibernate to communicate with arelational database.

I think being in some sort of managed environment (e.g.
Spring) is likely needed in any real-world approach.  So I'd turn this
around and ask where is the complexity?


First complexity: building Cocoon

Second complexity: building any component that has additional dependencies

Third complexity: deploying a new (non-trivial) component within a plugin

Fourth complexity: a community that is pulling in many different directions

There are many more but I will leave it at that. If you don't agree thenI suggest you actually try it before arguing the case. You can then tellme where I am going wrong.

Of course, it can be argued that 1-3 are because Forrest was builtagainst a much older version of Cocoon and has failed to keep up (forexample why a plugins not Cocoon blocks?). I would respond that this isbecause of the fourth complexity.

So, then it can be argued that we should be contributing to Cocoon andhelping resolve the fourth complexity. That may be the outcome of thisRT, it may not.

Decide what output plugin to use
--------------------------------

This is done by examining the requested URL. The actual selection of the
output plugin is done within the Cocoon sitemap. I have all the same
arguments here as I do for input plugins, this only needs to be a simple
lookup, not a complex pipeline operation.



I get the feeling you're basing this on the simplest use-case
imaginable.  The output plugin is about the format of the output not
the content of the output.  The sitemap benefits here allow for more
complex processing (e.g. user profiling, smart content delivery, etc.)

I disagree. The sitemap is a way of *configuring* this complexprocessing, it is not the processing itself. The sitemap has become anXML programming language and I hate it for that reason.

Have you ever dived in to the implementation and tried to do anythinguseful in there?

The fact that the sitemap had become a programming language is onereason why Cocoon came up with the flow engine (e.g. to get rid ofactions). But if you use the flow engine then you are programming withJavascript, it's only a small step from there to Java. So are there anybenefits in using Javascript over Java?


In my opinion the answer is a resounding no, at least for our use case.

Generate the output format
--------------------------

This is typically done by an XSLT transformation and/or by a third party
library (i.e. FOP) I have the same arguments here as I do for the
generation of internal format documents, in fact the parts of Cocoon we
use are identical in both cases.



Yeah, output is just a transformer.  Same thoughts as above.


OK, back to aggregation since I argued earlier that it belongs here.

Aggregation is nothing more than the collation of a number of resourcesin response to a single request. It turns a single request to a numberof requests. Each individual request is handled just like any otherrequest. ASo what you have is a locationmap something like this:


<map match="foo/bar/**">
  <aggregate>
    <location src="..." required="true"/>
    <location src="..." required="false"/>
  </aggegate>
<map>

Caching
-------

Cocoons Caching mechanism is pretty good, but it has its limitations
within Forrest. In particular, we have discovered that the Locationmap
cannot be cached efficiently using the Cocoon mechanisms.



This may be true. We had a novice working on LM caching at the time
and I've learned quite a bit since then.  I'd like to re-evaluate this
before I'm willing to agree with with such a bold statement.

This illustrates my point exactly. I looked at this too and also failedto get a better solution.

The reason I failed (and I guess the same for you) is that the code isjust so complex and jumbled that it's next to impossible to find onesway around once one gets past the API.

This is now
one of the key bottlenecks in Forrest.



Based on?  I'd like to see this profiling data.  Knowing that the LM
is our way ahead I've been worried about squeezing every ounce where
we could but I was still under the impression that it isn't a
consequential performance bottleneck.

Try building the Cocoon docs. Its set up on a Forrestbot in our zone.Even when co-located on the same physical machine as the source for thecontent it takes over 30 minutes to build. It really is a horrible solution.

If you want to profile it then you can get the forrest site from theCocoon-Whiteboard.

This is an extreme example case, but one that is quite common in myexperience using Forrest to do real document processing (as opposed toweb site generation).

We could work with Cocoon on their caching mechanism but there seems
little interest in this since our use case here is quite unusual. Of
course, we can do the work ourselves and add it to Cocoon. But why not
use a cacheting mechanism more suited to our needs?



So it's not 100% suitable so it's worthless?  It fits in 98% percent
of our needs so I don't see this as a compelling argument.

That's unfair. I'm saying it is not perfect, therefore it is notnecessary to use it. I did not say it is not perfect so lets get rid ofit. Please take this in the context of all the other problems I amhighlighing rather than considering it as a single point.

Besides it doesn't work for the locationmap, so in fact it is not usedin some of the processing of every single request we make. That'sconsiderably more than "2%"

Ready Made Transformations
--------------------------

...

You seem to be
suggesting that Cocoon requires some big overhead to do transforms and
that's simply not the case.

That's right, I call 40Mb of bloat a fair big overhead for doing XSLTtransformations.

This time I really am oversimplifying, but I hope you see my point -certainly that is how my customers see it. As a result I ended up, inmost cases, writing a series of Java components that I wired togethermanually and plugged directly into whatever framework they were using.This RT is about doing this in a more felxible and reusable way.

This complexity makes it difficult for newcomers to get started in using
Forrest for anything other than basic XSLT transformations.

...

 My point is that newcomers are
going to find it difficult to deal with any framework that attempts to
achieve anything beyond the simplistic.

Yes, but if the framework is designed to do one job (publishing in ourcase) then it is simpler to understand than if it is designed to doevery job (as with Cocoon).

The end result is that we have only one type of user - those doing XSLT
transformations.

Plugin Selection
----------------

This is done through the sitemap. This is perhaps where the biggest
advantage of Cocoon in our context can be found. The sitemap is a really
flexible way of describing a processing path.

However, it also provides loads of stuff we simply don't need when all
we are doing is transforming from one document structure to another. This
makes it complex to new users (although having our own sitemap
documentation would help here).

Finally, as discussed in the previous section, we don't need a complex
pipeline definition for our processing, we just need to plug an input
plugin to an output plugin via our internal format and that is it. We
have no need for all the sitemap bells and whistles.



I'm struggling to figure out what you think is forcing us into our
current apparently overly complex solution.  Is it the sitemap grammar

that is complex?

Not the grammar itself (although I do hate the fact that we are nowprogramming using the sitemap). The complexity is in processing of thatgramar whic results in the selection of the processing path to take.

All we need to do is select the right plugins and make them worktogether. Look at how many internal pipeline requests there are to dothis in Forrest now (its even worse if we use the dispatcher).


This is overly complex for what is ultimately a couple of lookups.

Learning curves aside, I'd rather sit on top of a framework that
supports a more complex solution than is my current problem because
experience has shown me that the initial requirements grow and I don't
want to have port when that growth happens.

This is exactly why I hate "catch all" frameworks. They try to be allthings to all people. I prefer to use what I need now and look atexpanding things when I find a use case that requires it. How can youknow in advance that the framework you choose is going to be adequatefor the job in hand? How do you know you won't eed Struts, or Ruby OnRails, or Wicket or SpringMVC or whatever?

This is personal opinion and we should really leave it at the door.Different people for different things. Our job is to decide what is bestfor the project not for us as idividuals. I'll just leave you with onethough...

If I'm going hiking I do not struggle carrying a family tent on my backjust because I may have some more children at some point in the future.

Conclusion
----------

Cocoon does not, IMHO, bring enough benefits to outweigh the overhead of
using it.

That overhead is:

- bloat (all those jars we don't need)



this is going to be addressed with maven (argghhh) and/or osgi someday
- it's a recognized issue by many cocooners.


"someday" is the optimal word there. I've been waiting too long.

If we reject this RT based on this argument then I want to see Forrestdevelopers helping Cocoon sort this out rather than standing by waitingfor it to happen.

- complex code (think of your first attempt to write a transformer)



I've never written a transformer.  I suspect that I could do it in a
day or less though depending upon the requirements.  It's simply
implementing XMLConsumer by handling SAX events, not that
extraordinary for a SAX-stream-based framework.  How do the many other
pipeline frameworks do transforms if not by handling SAX events?

Yes, transformers are simple. I should have picked non-trivialgenerators as discussed above. Especially since this is a more commonrequirement in the real world. That is we need input plugins to intefacewith existing corporate legacy code.

- complex configuration (sitemap, locationmap, xconf)



Like component managers nowadays, we've failed to strike a good
balance between flexibility (configurability) and ease of use.

I really can't agree with the "like component managers nowadays" part.Have you actually worked with something like Spring? It is unbelievablysimple.

- based on Avalon which is pretty much dead as a project



They are at least partially migrated to Spring for management
purposes.   I understood that as a move to eventually migrate fully
from Avalon to Spring.

Don't be fooled by the "headlines". Look into the code. Until the Avalonjars are gone then my point stands. Until someone here gets into theCocoon code and starts trying to disentangle things then my point stands.

Why don't I do that? I have other things to do, I need Forrest to beuseful, I don't use, and have never used, Cocoon independantly ofForrest (at least not commercially).

So Should We Re-Implement Forrest without Cocoon?
=================================================

In order to find an answer to this question lets consider how we might
re-implement Forrest without Cocoon:

Locate the source document
--------------------------

We do this through the locationmap and can continue to do so. We would
need to write a new locationmap parser though.  This would simply do the
following (note, no consideration of caching at this stage, but there
are a number of potential cache points in the pseudo code below):

Assumes that matching and selection have already been implementedsomewhere?

Yes, the way I see it, regular expressions are pretty standard and wellsupported.

...

Generate the internal document
------------------------------

Since the plugins are now loaded via a component manager our
transformation classes are POJO's that are independant of any particular
execution environemnt, therefore, there is no need to do anything
clever here.



I don't understand.  They need input/output contracts, right?  There
aren't standards defined for such things so it is execution
environment dependent.  The concept of a POJO is honestly really gray
to me.  I view Cocoon's transformation classes as POJO's.  I've tried
to grasp this POJO concept before and gotten lost. The Java community
certainly has a knack for the creation of buzzwords with blurry
meaning.

I'm not really using POJO in the correct context here. All a pluginneeds is a method to do its stuff. This could be called "execute". Theinput would be a SAX stream (for which there are multiple standardimplementations), the output would also be a sax stream.

There is no dependency on anything else. Even the container manager inuse would be independant from the plugins and could be replaced at any time.

So is this interesting or not?



Not so far...  I'm not convinced.  I think you're implicitly
describing an oversimplified use-case, overstating the complexity of
Cocoon, and glossing over what we get from Cocoon.  More to come...

Tim, you have argued against my points, are there none that you seemerit in? It would be helpful if you could highlight any points that youfeel are valid, even just by saying "yes, OK". This will enable us topull the good stuff out of this thread and to let the bad stuff just rotaway.

I would also like you to provide examples of why I am oversimplifyingthings here. I do not believe I am, but I have the benefit of knowingeverything I am trying to say. If you can highlight some specificproblem use cases I can address them directly and/or work on a solutionfor those cases (just like I have done with aggregation above).


Thanks for your feedback.

Ross

Re: [RT] A new Forrest implementation?

Reply via email to