Andreas Hochsteger wrote:
Hi Cocooners!

Sorry for this (very) long proposal below, but I think it's definitely worth a read. If not, at least you can give me some feedback about your opinion ;-)
Andreas,

thanks for taking the time for writing this. It is very appreciated. See my personal comments inside. NOTE: they are 'personal' commment and must be treated as such, they never represent the cocoon development community but my personal vision of things.

[snip]

WARNING:
I have to say that this proposal is intended for open-minded people only, which aren't afraid to take a look beyond the limits.
I think I can state I'm not afraid to look beyind limits, expecially my own, expecially those I can't see until others point me to. At the same time, I like not to turn of my 'critical mode' while I do so. Please, don't misinterpret this as fear of going forward, but as caution as doing so.

[snip]

3 Introduction
==============

I like the Cocoon pipeline processing concept very much.
I like it so much, that I think it is a pitty, to limit it only to XML processing (although I agree, that this is the most interresting application).
These two sentences are antithetical and/or imprecise.

The Cocoon pipepeline model is different from the more general Pipe&Filters design pattern because it deals with structured data, unlike the P&F which deals with non-structured data.

The Cocoon pipeline is *not* litterarely limited to XML. It is entirely possible to have not-well-formed XML content flow into the pipeline (even if this is avoided as a general pattern).

It is correct to say that cocoon pipelines are limited to SAX events and SAX events are a particular kind of structured data.

With this corrections, you are basically stating that limiting pipelines to a particular type of structured data is limiting.

While I understand your concept, I strongly disagree: SAX provides a multidimensional structured data space which is suitable for *any* kind of data structure.

True, maybe not as efficiently as other formats, but removing a fix contract between pipeline components will require a pluggable and metadata-driven parsing/serializatin stage between each component.

I don't see any value of this compared to the current approach of SAX adaptation of external data to the internal model.

I'm sure some of you wanted to be able to build applications the same way like Unix shell pipes work. Cocoon was a big step in this direction, but it was only applicable for processing XML data.
*only XML* is misleading. *based on SAX* is the sentence. I've never perceived this as a limitation, but as a paradigm shift.

Topologically speaking, the solutions space is rotated, but it's size is not reduced.

There are so many cases where pipeline processing of data (no matter if it is XML, plain text or binary data) is done today but we are lacking a generic and declarative way to unify these processing steps. Cocoon is best suited for this task through it's clean and easy to understand yet powerful pipeline concept.
If you want to create pipelines for genereral data, why use Cocoon? just use the UNIX pipe or use servlet filters or apache 2.0 modules or any type of 'byte-oriented' (thus un-structured data) pipe&filters modules.

If you remove the structure from the pipeline data that flows, Cocoon will no be Cocoon anymore. This is not evolution, is extintion.

4 Pipeline Types
================

I tried to design several pipelines variants but after thinking a while they all were still too limited for the way I wanted them to work.

So here's another try by giving some hypotheses first:
1. A pipeline can produce data
2. A pipeline can consume data
3. A pipeline can convert data
4. A pipeline can filter data
5. A pipeline can accept a certain data format as input
6. A pipeline can produce a certain data format as output
7. Pipeline components follow the same hypotheses (1-6)
8. Only pipeline components with compatible data formats can be arranged next to each other
Ah, here you hint that you don't want to remove data structured-ness in the pipeline, just want to add *other* data structures besides SAX events.

Ok, this is worth investigating.

Based on these hypotheses you can construct pipelines, which just consume data, just produce data, both consume and produce data or even neither consume nor produce data (even this can make sense, as you'll see in section "9.5 Action Pipelines").
I think these hypotheses are simple enough to understand and flexible enough to base this further proposal on. So let's try ...

To define a pipeline we need to be able to specify the input and output format.
We can do this by the help of these two attributes:
- input-format="..."
- output-format="..."

They additionally specify the default input format for the first processing component and the default output format for the last processing component.

Example:
<map:pipeline input-format="format1" output-format="format2">
...
</map:pipeline>

This pipeline consumes the data format "format1" and produces the data format "format2". Which data formats are possible and how they are specified is shown in the next section.


5 Data Formats
==============

With "data format" I mean something like XML, plain text, png, mp3, ...
I'm not yet really sure here, how we should specify data formats, so I'll try to start with some requirements:
1. They should be easy to remember and to specify ;-)
2. It should be possible to create derived data formats (-> inheritance)
3. It should be possible to specify additional information (e.g. MIME type, DTD/Schema for XML, ...)
4. Pipelines which accept a certain data format as input can be fed with derived data formats
5. We should not reinvent standards, which are already suited for this task (but I fear, there does not yet exist something suitable)
You are asking for a very abstract parsing grammar. Note, however, that is pretty easy to point to examples where these grammars will have to be so complex that maintaining them would be a nightmare.

Think of a BNF-like grammar that is able to explain concepts like XML namespacing or HyTime Architectural Forms.

To make it easier for us to begin with the task of defining data formats, let's assume, we have three basic data formats called "abstract", "binary" and "text". The format "abstract" will be explained later, but "binary" and "text" should be clear to everyone.
Binary and text are unstructured data streams. You are falling back.

5.1 Data Format Definition
--------------------------

Here's a try to specify a hierarchy of data formats:

<data:formats>
<!-- #### Super data format #### -->
<!--
The following format is the base for all other formats (-> compare to java.lang.Object)
Although it is called 'any' data format this name is not prepended to the derived data formats like this is the case for all -->
<data:format name="any" impl="org.apache.cocoon.data.handler.text.DefaultHandler">
<data:param-def name="mime-type" default="application/octet-stream"/>
<data:param-def name="spec" default=""/> <!-- URL to the specification of this data format -->
</data:format>

<!-- #### Abstract data formats #### -->
<data:format name="abstract" impl="org.apache.cocoon.data.handler.abstract.DefaultHandler"/>
<data:format name="image" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.ImageHandler">
<data:param-def name="depth" default=""/>
<data:param-def name="width" default=""/>
<data:param-def name="height" default=""/>
</data:format>
<data:format name="music" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.MusicHandler">
<data:param-def name="channels" default=""/>
</data:format>
<data:format name="sound" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.SoundHandler">
<data:param-def name="samplesize" default=""/>
<data:param-def name="samplerate" default=""/>
<data:param-def name="channels" default=""/>
</data:format>
<!--
Multiple inheritance is used for video, wich extends image and sound.
Is there a better way to specify multiple base formats? -->
<data:format name="video" extends="/abstract/image /abstract/sound" impl="org.apache.cocoon.data.handler.abstract.VideoHandler">
<data:param-def name="framerate" default=""/>
</data:format>
<data:format name="vector" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.VectorHandler">
<data:param-def name="unit" default=""/>
<data:param-def name="width" default=""/>
<data:param-def name="height" default=""/>
</data:format>
<data:format name="3d" extends="/abstract/vector" impl="org.apache.cocoon.data.handler.abstract.3DHandler">
<data:param-def name="depth" default=""/>
</data:format>

<!-- #### Binary based data formats #### -->
<data:format name="binary" impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
<data:param-def name="endian" default="little"/>
</data:format>

<!-- MS OLE based data formats -->
<data:format name="ole" extends="/binary" impl="org.apache.cocoon.data.handler.binary.ole.DefaultHandler"/>
<data:format name="msword" extends="/binary/ole" impl="org.apache.cocoon.data.handler.binary.ole.MSWordHandler"/>
<data:format name="msexcel" extends="/binary/ole" impl="org.apache.cocoon.data.handler.binary.ole.MSExcelHandler"/>

<!-- Linux ELF based data formats -->
<data:format name="binary" impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
<data:param-def name="endian" default="little"/>
</data:format>
<data:format name="elf" extends="/binary" impl="org.apache.cocoon.data.handler.binary.elf.DefaultHandler">
<data:param-def name="architecture" default="x86"/>
</data:format>
<data:format name="executable" extends="/binary/elf" impl="org.apache.cocoon.data.handler.binary.elf.ExecutableHandler"/>
<data:format name="shared" extends="binary/elf" impl="org.apache.cocoon.data.handler.binary.elf.SharedLibraryHandler"/>

<!-- #### Text based data formats #### -->
<data:format name="text" impl="org.apache.cocoon.data.handler.text.DefaultHandler">
<data:param-def name="encoding" default="UTF-8"/>
<data:parameter name="mime-type" value="text/plain"/>
</data:format>
<data:format name="xml" extends="/text" impl="org.apache.cocoon.data.handler.xml.DefaultHandler">
<!-- this handler deals with SAX events inside the pipeline -->
<data:param-def name="schema-type" default="xsd"/> <!-- other possible values: dtd, rng, schematron, ... -->
<data:param-def name="schema" default=""/>
<data:parameter name="mime-type" value="text/xml"/>
</data:format>
<data:format name="xhtml" extends="/text/xml" impl="org.apache.cocoon.data.handler.xml.XHTMLHandler">
<data:parameter name="mime-type" value="text/html"/>
<data:parameter name="schema" value="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
</data:format>
</data:formats>

It's just a first sketch, but I think you got the idea.

Above you can see the super data format 'any', some abstract, text and binary data formats, which show you how to specify inherited data formats. If no extends="..." attribute is given, it is automatically derived from the data format 'any'.

References to data formats are done by using a path which specifies the respective data format. This path is built by appending the specified data format name to the path of the parent data format, separated by a slash. The super data format is an exception to this rule and is just called 'any'. It is not part of the path for derived data formats to make them shorter. It is possible to use relative data format paths too. E.g. a pipeline consumes /text/xml, a converter generates XHTML from it an thus can use output-format="xhtml" instead of output-format="/text/xml/xhtml". The name 'any' is reserved only for the super data format and it is not allowed to name derived data formats after it.

'none' is an other reserved name which is used, if a pipeline does not consume data (input-format="none") or produce data (output-format="none"). It is the default for all pipelines, if it is not overwritten by pipelines or their components.


The examples from above can be used by using the following strings for specifying data formats:

- any
- /abstract/image
- /abstract/music
- /abstract/sound
- /abstract/video
- /abstract/vector
- /abstract/vector/3d
- /binary
- /binary/ole
- /binary/ole/msword
- /binary/ole/msexcel
- /binary/elf
- /binary/elf/executable
- /binary/elf/shared
- /text
- /text/xml
- /text/xml/xhtml

See section "16.1 Data Formats" for more examples.

One enhancement of this scheme might be useful: Specification of version numbers or format variants.
One way might be to append the version number to the end separated by a slash, but I think this will mix different concerns. My suggestion would be to specify them by appending the version information in brackets as the following shows:

- /text/xml/xhtml[1.0]
- /text/xml/xhtml[1.1]

Instead of:

- /text/xml/xhtml/1.0
- /text/xml/xhtml/1.1


5.2 Inheritance
---------------

A pipeline which consumes a certain data format can be fed with derived data formats too.
Take the following pipeline as example:

<map:pipeline input-format="/text/xml">
...
</map:pipeline>

This pipeline would consume the data format "/text/xml/xhtml" without problems, but leads to an exception if you feed it with the data format "/text".


5.3 A word about MIME Types
---------------------------

If you ask me, why don't I use the standardized MIME types (see [2]) to specify data formats, I can give you the following reasons:
MIME types fulfill the requirements from above just partly. They just support two levels of classification and they are purpose-oriented. The data formats I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml). So both serve different purposes.

I know the importance of supporting the MIME type standard, and so the parameter 'mime-type' is part of the super data format 'any' and thus is available for every other data format too. By specifying a certain data format, you always have a MIME type associated, in the worst case the MIME type from the super data format 'any' (application/octet-stream) is used.
From what I see so far, you are describing nothing different (from an architectural point of view) from what we already have.

5.4 Data Handlers
-----------------

I'm not very sure, what the data handlers actually do, but I can think of either defining an interface, which must be implemented by the pipeline components which operate with a certain data format (do we need two handlers here: input-handler and output-handler?) or they are concrete components which can be used by the pipeline components to consume or produce this data format. I think some discussion on this topic might not be bad.
Here you hit the nerve.

If you plan on having a different interface of data-handling for each data-type (or data-type family), the permutation of components will kill you.

5.5 Data Format Determination
-----------------------------

In many cases, I've written the input- and output-format along with the pipeline components, but it is also possible to specify them in the <map:components/> section or implicitely by implementing a certain component interface and therefore omitting it in the pipeline.

Here's a suggested order of data format determination:

1. Input-/output-Format specified directly with a pipeline component
<map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/>
2. Input-/output-Format specified by the component declaration
<map:filters>
<map:filter name="prettyxml" input-format="/text/xml" output-format="/text/xml" ... />
</map:filters>
3. Output-/input-Format specified by the previous or following pipeline component
<map:produce type="uri" ref="docs/file.xhtml" output-format="/text/xml/xhtml"/>
<!-- input- and output-format="/text/xml/xhtml" from previous pipeline component -->
<map:filter type="prettyxml"/>
4. Input-/output-Format specified directly with a pipeline
<map:pipeline input-format="/text/xml" output-format="/text/xml">
<map:filter type="prettyxml"/>
...
</map:pipeline>
5. If nothing from above matches then assume "none".
eheh, I wish it was that easy ;-)

Suppose you have a component that operates on the svg: namespace of a SAX stream only, what is the input type?

if data types are monodimensional, the above is feasible, but Cocoon pipelines are *already* multi-dimensional and the above can't possibly work (this has been discussed extensively before for pipeline validation)

6 Pipeline Components
=====================
[snip]

Assuming you have several structured pipelines:

- SAX -> all xml/sgml content
- output/input streams -> unstructured text/binary
- OLE -> all OLE-based files (word, excel, blah blah)
- MPEG -> all MPEG-based framed multimedia (MPEG1/2, mp3)

why would you want to mix them into the same system?

I mean, if you want to apply structured-pipeline architectures to, say, audio editing, you are welcome to do so, but why in hell should Cocoon have to deal with this?

You are very close to win the prize for the FS-award of the year :)

It *would* make sense to add these complexities only if processing performed in different realms could be interoperated. But I can't see how.

what does it mean to perform xstl-transformation on a video stream?

what does it mean to perform audio mixing on an email?

It would not make any sense to add functionalities inside cocoon that do not belong in the real of its problem space. It would only dilute the effort in the additional complexity only for sake of flexibility.

7 Protocol Independence
=======================

Currently Cocoon is tightly bound to certain protocols by running an instance of it in a certain environment (servlet, CLI) and it's not (easy) possible to handle different invocation protocols from the same instance. To abstract the transport protocols (through the use of certain consumers or producers) we already have a good working base. What is missing is binding a protocol to a certain port, but we should not duplicate work here, which is better left to other software like Apache or Tomcat. We just need to find a way (which I'm sure, that already exists somewhere) to serve different ports with different protocols. I think the Servlet specification is general enough to not only support HTTP/HTTPS and can help us here.
The servlet API is bound to the request/response paradigm and implicitly assumes that response goes to the same address of the request. This is not even close to be general enough for protocol abstraction.

Given the case, that we have solved the port binding issue, we need some abstraction of the transport protocol. What I mean here is that I'd like to use pipelines independent from the way the request has been sent to Cocoon and how it has to be sent back to the client.

To solve this we need something like a protocol handler, which maps requests from certain protocols to certain pipelines. The mapping itself is a very abstract thing and heavily depends on the used protocol.
This will make cocoon overlap with protocol-handling concerns.

Let's assume, we even solved the protocol handler issue, I'd like to sketch some possible use cases below, before we continue.


7.1 Web Services
----------------

As many of you know there are existing two popular styles to use Web Services: SOAP and REST.
Both have their own advantages and disadvantages but I'd like to concentrate on SOAP and on it's transport protocol independence, because REST-style Web Services are already possible to do with Cocoon.

SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly HTTP(S) is used therefore, but there are many cases, where you have to use other protocols (like SMTP, FTP, ...).
Whatever protocol you chose to invoke your Web Services the result should be always the same and the response should be delivered back through (mostly) the same protocol. Here is one of the greatest advantages of the protocol independance.
No, this is not protocol independence. This is transport independance, you are still dependent on SOAP as a protocol.

What you want to do now is to implement the Web Service as a bunch of pipelines and let the protocol handler be responsible for invoking the same pipeline no matter which protocol has been used.


7.2 Mail Server
---------------

Nothing hinders you to implement a mail server, which has the possibility to integrate various data sources and to expose it's functionality via the traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web Service, and what ever you want.


7.3 Mailing List Manager
------------------------

Mailing list managers typically provide several functions (subscribe, unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list of subscribed users. Once again, you can write such a service once and expose it's functionality through traditional protocols (HTTP, SMTP, ...) but also as Web Service.


7.4 What else?
--------------

Perhaps you realize that this way you are free to implement every application you want by the use of the easy declarative pipeline processing concept. How to connect your application to the world outside is a seperate issue which you can decide later and specify independant from the application.


8 Protocol Handler
==================
I don't think Cocoon should implement protocol handlers. Cocoon is a data producer, should not deal with transport.

We already have enough problems to try to come up with an Enviornment that could work with both email and web (which have orthogonal client/server paradigms), I don't want to further increase the complexity down this road.

[snip]

11 Converting old sitemaps to new sitemaps
==========================================

Some of you might be interested, if this new concept is flexible enough to provide at least the same functionality as Cocoon does today.
Yes, I agree that the architecture you describe can be seen as an 'extention' of what Cocoon has today, therefore is possible to rewrite current sitemaps in the model you propose.

yet, I fail to see the advantage of doing so. Since you don't gain any functionality in the problem space where cocoon lives on.

12 Use Cases
============
you provide fancy use cases but they show me the power of the structured pipe&filter design pattern, they don't tell me why we should do this in cocoon.

because it's cool, or because it's doable are not very good arguments around here.

13 Conclusion
=============

You might ask, why should we change so much from Cocoon?
exactly.

First I think the new components are much more flexible and at least as easy to understand as the old ones: If you want to produce a data stream you use a producer, if you want to consume it you use a consumer, if you want to convert it you use a converter and if you want to filter it you use a filter.
that is your personal view and can't stand as an objective argument.

To control the data flow you can use the <map:branch/> component.

A possible migration path could be to support both sitemap versions, since the pipeline components either have different names or provide the same functionality. So a new sitemap implementation could be backward compatible to older sitemap versions. This could make the transition for the user as easy as possible.

Additionally it might be possible to provida a migration script (e.g. via XSL) which reads an old sitemap and converts it to the new format. Since everything from the old sitemap can be expressed in the new sitemap and can be formally translated (see section "11 Converting old sitemaps to new sitemaps") this should not be a big issue.
You don't say *why* we should do this. What do we gain? why should we do audio/video processing on the server side? why should we introduce components that work on just one pipeline model and can't be shared with others?

Oh, you definately win my vote for the FS of the year award :)

--
Stefano Mazzocchi <[EMAIL PROTECTED]>
Pluralitas non est ponenda sine necessitate [William of Ockham]
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to