Hi Cocooners!

Sorry for this (very) long proposal below, but I think it's definitely worth a 
read. If not, at least you can give me some feedback about your opinion ;-)

Bye,

        Andreas Hochsteger

1 Contents
==========

1       Contents
2       Prologue
3       Introduction
4       Pipeline Types
5       Data Formats
5.1     Data Format Definition
5.2     Inheritance
5.3     A word about MIME Types
5.4     Data Handlers
5.5     Data Format Determination
6       Pipeline Components
6.1     Producers
6.2     Consumers
6.3     Converters
6.4     Filters
6.5     Aggregators
6.6     Actions
6.7     Redirectors
6.8     Matchers
6.9     Branches
6.10    Exceptions
7       Protocol Independence
7.1     Web Services
7.2     Mail Server
7.3     Mailing List Manager
7.4     What else?
8       Protocol Handler
8.1     Component Definition
8.2     Protocol Binding
8.3     The Handler's Task
8.4     Mapping to Pipelines
9       Pipelines as Pipeline Components
9.1     Producer Pipelines
9.2     Consumer Pipelines
9.3     Converter Pipelines
9.4     Filter Pipelines
9.5     Action Pipelines
10      Configuration Files
10.1    cocoon.xconf
10.2    components.xconf
10.3    protocols.xconf
10.4    bindings.xconf
10.5    protocol-mappings.xconf
10.6    data-formats.xconf
10.7    sitemap.xmap
10.8    Config File Hierarchy
11      Converting old sitemaps to new sitemaps
11.1    Generators
11.2    Transformers
11.3    Readers
11.4    Serializers
11.5    Selectors
12      Use Cases
12.1    File Upload
12.2    Combining several pipelines
12.3    Unix Pipes
12.4    Image Processing
12.5    PDF decompiling
12.6    Music Processing
13      Conclusion
14      TODO
15      References
16      Appendix
16.1    Data Formats
16.2    Pipeline Components


2 Prologue
==========

I wrote most of this proposal and some other unfinished one while I had to 
stay in hospital for two weeks in the end of November 2002. Luckily I was 
armed with my notebook loaded with a CVS snapshot of Cocoon and the great 
Cocoon book from Matthew Langham and Carsten Ziegeler.
So I could finally do something productive ;-)

After returning home I had no time to finish it submit it to the public. In 
the mean time some discussion on similar topics arrived on the cocoon-dev 
mailing list (see [1]) and I forced myself to find some time again to work on 
this proposal and finally publish it on the cocoon-dev mailing list.

Perhaps I'll find some time to convert it to an XML format (e.g. Docbook) and 
write a converter to publish it on the Cocoon Documentation Wiki, but first 
let's discuss a bit on the mailing list.

WARNING:
I have to say that this proposal is intended for open-minded people only, 
which aren't afraid to take a look beyond the limits. Anything I'm writing 
here might be totally crap for you, so fell free to ignore it, or send your 
flames to /dev/null ;-)
If you are still interested, please join this journey to a world, where no man 
has gone before ...


3 Introduction
==============

I like the Cocoon pipeline processing concept very much.
I like it so much, that I think it is a pitty, to limit it only to XML 
processing (although I agree, that this is the most interresting 
application).

I'm sure some of you wanted to be able to build applications the same way like 
Unix shell pipes work. Cocoon was a big step in this direction, but it was 
only applicable for processing XML data. There are so many cases where 
pipeline processing of data (no matter if it is XML, plain text or binary 
data) is done today but we are lacking a generic and declarative way to unify 
these processing steps. Cocoon is best suited for this task through it's 
clean and easy to understand yet powerful pipeline concept.


4 Pipeline Types
================

I tried to design several pipelines variants but after thinking a while they 
all were still too limited for the way I wanted them to work.

So here's another try by giving some hypotheses first:
1. A pipeline can produce data
2. A pipeline can consume data
3. A pipeline can convert data
4. A pipeline can filter data
5. A pipeline can accept a certain data format as input
6. A pipeline can produce a certain data format as output
7. Pipeline components follow the same hypotheses (1-6)
8. Only pipeline components with compatible data formats can be arranged next 
to each other

Based on these hypotheses you can construct pipelines, which just consume 
data, just produce data, both consume and produce data or even neither 
consume nor produce data (even this can make sense, as you'll see in section 
"9.5 Action Pipelines").
I think these hypotheses are simple enough to understand and flexible enough 
to base this further proposal on. So let's try ...

To define a pipeline we need to be able to specify the input and output 
format.
We can do this by the help of these two attributes:
 - input-format="..."
 - output-format="..."

They additionally specify the default input format for the first processing 
component and the default output format for the last processing component.

Example:
        <map:pipeline input-format="format1" output-format="format2">
                ...
        </map:pipeline>

This pipeline consumes the data format "format1" and produces the data format 
"format2". Which data formats are possible and how they are specified is 
shown in the next section.


5 Data Formats
==============

With "data format" I mean something like XML, plain text, png, mp3, ...
I'm not yet really sure here, how we should specify data formats, so I'll try 
to start with some requirements:
1. They should be easy to remember and to specify ;-)
2. It should be possible to create derived data formats (-> inheritance)
3. It should be possible to specify additional information (e.g. MIME type, 
DTD/Schema for XML, ...)
4. Pipelines which accept a certain data format as input can be fed with 
derived data formats
5. We should not reinvent standards, which are already suited for this task 
(but I fear, there does not yet exist something suitable)

To make it easier for us to begin with the task of defining data formats, 
let's assume, we have three basic data formats called "abstract", "binary" 
and "text". The format "abstract" will be explained later, but "binary" and 
"text" should be clear to everyone.


5.1 Data Format Definition
--------------------------

Here's a try to specify a hierarchy of data formats:

        <data:formats>
                <!-- #### Super data format #### -->
                <!--
                        The following format is the base for all other formats (-> 
compare to 
java.lang.Object)
                        Although it is called 'any' data format this name is not 
prepended to the 
derived data formats                    like this is the case for all 
                -->
                <data:format name="any" 
impl="org.apache.cocoon.data.handler.text.DefaultHandler">
                        <data:param-def name="mime-type" 
default="application/octet-stream"/>
                        <data:param-def name="spec" default=""/> <!-- URL to the 
specification of 
this data format -->
                </data:format>

                <!-- #### Abstract data formats #### -->
                <data:format name="abstract" 
impl="org.apache.cocoon.data.handler.abstract.DefaultHandler"/>
                <data:format name="image" extends="/abstract" 
impl="org.apache.cocoon.data.handler.abstract.ImageHandler">
                        <data:param-def name="depth" default=""/>
                        <data:param-def name="width" default=""/>
                        <data:param-def name="height" default=""/>
                </data:format>
                <data:format name="music" extends="/abstract" 
impl="org.apache.cocoon.data.handler.abstract.MusicHandler">
                        <data:param-def name="channels" default=""/>
                </data:format>
                <data:format name="sound" extends="/abstract" 
impl="org.apache.cocoon.data.handler.abstract.SoundHandler">
                        <data:param-def name="samplesize" default=""/>
                        <data:param-def name="samplerate" default=""/>
                        <data:param-def name="channels" default=""/>
                </data:format>
                <!--
                        Multiple inheritance is used for video, wich extends image and 
sound.
                        Is there a better way to specify multiple base formats?        
          -->
                <data:format name="video" extends="/abstract/image /abstract/sound" 
impl="org.apache.cocoon.data.handler.abstract.VideoHandler">
                        <data:param-def name="framerate" default=""/>
                </data:format>
                <data:format name="vector" extends="/abstract" 
impl="org.apache.cocoon.data.handler.abstract.VectorHandler">
                        <data:param-def name="unit" default=""/>
                        <data:param-def name="width" default=""/>
                        <data:param-def name="height" default=""/>
                </data:format>
                <data:format name="3d" extends="/abstract/vector" 
impl="org.apache.cocoon.data.handler.abstract.3DHandler">
                        <data:param-def name="depth" default=""/>
                </data:format>
                
                <!-- #### Binary based data formats #### -->
                <data:format name="binary" 
impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
                        <data:param-def name="endian" default="little"/>
                </data:format>

                <!-- MS OLE based data formats -->
                <data:format name="ole" extends="/binary" 
impl="org.apache.cocoon.data.handler.binary.ole.DefaultHandler"/>
                <data:format name="msword" extends="/binary/ole" 
impl="org.apache.cocoon.data.handler.binary.ole.MSWordHandler"/>
                <data:format name="msexcel" extends="/binary/ole" 
impl="org.apache.cocoon.data.handler.binary.ole.MSExcelHandler"/>

                <!-- Linux ELF based data formats -->
                <data:format name="binary" 
impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
                        <data:param-def name="endian" default="little"/>
                </data:format>
                <data:format name="elf" extends="/binary" 
impl="org.apache.cocoon.data.handler.binary.elf.DefaultHandler">
                        <data:param-def name="architecture" default="x86"/>
                </data:format>
                <data:format name="executable" extends="/binary/elf" 
impl="org.apache.cocoon.data.handler.binary.elf.ExecutableHandler"/>
                <data:format name="shared" extends="binary/elf" 
impl="org.apache.cocoon.data.handler.binary.elf.SharedLibraryHandler"/>

                <!-- #### Text based data formats #### -->
                <data:format name="text" 
impl="org.apache.cocoon.data.handler.text.DefaultHandler">
                        <data:param-def name="encoding" default="UTF-8"/>
                        <data:parameter name="mime-type" value="text/plain"/>
                </data:format>
                <data:format name="xml" extends="/text" 
impl="org.apache.cocoon.data.handler.xml.DefaultHandler">
                        <!-- this handler deals with SAX events inside the pipeline -->
                        <data:param-def name="schema-type" default="xsd"/> <!-- other 
possible 
values: dtd, rng, schematron, ... -->
                        <data:param-def name="schema" default=""/>
                        <data:parameter name="mime-type" value="text/xml"/>
                </data:format>
                <data:format name="xhtml" extends="/text/xml" 
impl="org.apache.cocoon.data.handler.xml.XHTMLHandler">
                        <data:parameter name="mime-type" value="text/html"/>
                        <data:parameter name="schema" 
value="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
                </data:format>
        </data:formats>

It's just a first sketch, but I think you got the idea.

Above you can see the super data format 'any', some abstract, text and binary 
data formats, which show you how to specify inherited data formats. If no 
extends="..." attribute is given, it is automatically derived from the data 
format 'any'.

References to data formats are done by using a path which specifies the 
respective data format. This path is built by appending the specified data 
format name to the path of the parent data format, separated by a slash. The 
super data format is an exception to this rule and is just called 'any'. It 
is not part of the path for derived data formats to make them shorter. It is 
possible to use relative data format paths too. E.g. a pipeline consumes 
/text/xml, a converter generates XHTML from it an thus can use 
output-format="xhtml" instead of output-format="/text/xml/xhtml". The name 
'any' is reserved only for the super data format and it is not allowed to 
name derived data formats after it.

'none' is an other reserved name which is used, if a pipeline does not consume 
data (input-format="none") or produce data (output-format="none"). It is the 
default for all pipelines, if it is not overwritten by pipelines or their 
components.


The examples from above can be used by using the following strings for 
specifying data formats:

 - any
 - /abstract/image
 - /abstract/music
 - /abstract/sound
 - /abstract/video
 - /abstract/vector
 - /abstract/vector/3d
 - /binary
 - /binary/ole
 - /binary/ole/msword
 - /binary/ole/msexcel
 - /binary/elf
 - /binary/elf/executable
 - /binary/elf/shared
 - /text
 - /text/xml
 - /text/xml/xhtml

See section "16.1 Data Formats" for more examples.

One enhancement of this scheme might be useful: Specification of version 
numbers or format variants.
One way might be to append the version number to the end separated by a slash, 
but I think this will mix different concerns. My suggestion would be to 
specify them by appending the version information in brackets as the 
following shows:

 - /text/xml/xhtml[1.0]
 - /text/xml/xhtml[1.1]

Instead of:

 - /text/xml/xhtml/1.0
 - /text/xml/xhtml/1.1


5.2 Inheritance
---------------

A pipeline which consumes a certain data format can be fed with derived data 
formats too.
Take the following pipeline as example:

        <map:pipeline input-format="/text/xml">
                ...
        </map:pipeline>

This pipeline would consume the data format "/text/xml/xhtml" without 
problems, but leads to an exception if you feed it with the data format 
"/text".


5.3 A word about MIME Types
---------------------------

If you ask me, why don't I use the standardized MIME types (see [2]) to 
specify data formats, I can give you the following reasons:
MIME types fulfill the requirements from above just partly. They just support 
two levels of classification and they are purpose-oriented. The data formats 
I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml). 
So both serve different purposes.

I know the importance of supporting the MIME type standard, and so the 
parameter 'mime-type' is part of the super data format 'any' and thus is 
available for every other data format too. By specifying a certain data 
format, you always have a MIME type associated, in the worst case the MIME 
type from the super data format 'any' (application/octet-stream) is used.


5.4 Data Handlers
-----------------

I'm not very sure, what the data handlers actually do, but I can think of 
either defining an interface, which must be implemented by the pipeline 
components which operate with a certain data format (do we need two handlers 
here: input-handler and output-handler?) or they are concrete components 
which can be used by the pipeline components to consume or produce this data 
format. I think some discussion on this topic might not be bad.


5.5 Data Format Determination
-----------------------------

In many cases, I've written the input- and output-format along with the 
pipeline components, but it is also possible to specify them in the 
<map:components/> section or implicitely by implementing a certain component 
interface and therefore omitting it in the pipeline.

Here's a suggested order of data format determination:

1. Input-/output-Format specified directly with a pipeline component
        <map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/>
2. Input-/output-Format specified by the component declaration
        <map:filters>
                <map:filter name="prettyxml" input-format="/text/xml" 
output-format="/text/xml" ... />
        </map:filters>
3. Output-/input-Format specified by the previous or following pipeline 
component
        <map:produce type="uri" ref="docs/file.xhtml" 
output-format="/text/xml/xhtml"/>
        <!-- input- and output-format="/text/xml/xhtml" from previous pipeline 
component -->
        <map:filter type="prettyxml"/>
4. Input-/output-Format specified directly with a pipeline
        <map:pipeline input-format="/text/xml" output-format="/text/xml">
                <map:filter type="prettyxml"/>
                ...
        </map:pipeline>
5. If nothing from above matches then assume "none".


6 Pipeline Components
=====================

Now that we have a big picture of the pipelines and a flexible way to specify 
data formats which flow through the pipelines we can move on to specify the 
pipeline components.

To allow a fresh and clean design, abandon all known pipeline components like 
generators, transformers, serializers, ... and what you know about their 
functionality. I'll use the same names where this makes sense, but keep in 
mind, that we are not only talking about processing XML data, so their 
functionality may be different.

Currently Cocoon pipeline components are all working with XML data. In this 
proposal the components are meant to process any data format available and 
I'm sure you'll agree that great care has to be taken to manage the huge 
ammount of possible pipeline components. One problem here is the flat 
specification of component names. As a solution for this I'd suggest to use 
hierarchical path names to specify component names and group related 
components under the same path.


6.1 Producers
-------------

They simply produce a data stream, possibly by reading data from a data 
repository. Producers are used if no data is consumed from the pipeline and 
are usually placed at the beginning of a pipeline.

Component definition:
        <map:producers default="uri">
                ...
                <!--
                        The following producer is similar to the old file generator 
but can produce 
any data format.
                        I renamed 'file' to 'uri' since it does not only read files, 
but any 
resource, 
                        which can be expressed by an URI and the protocol is known.
                -->
                <map:producer name="uri" 
impl="org.apache.cocoon.pipeline.producer.URIProducer" output-format="any"/>
                <!-- The next producer might be identical to the old file generator. 
-->
                <map:producer name="xml/uri" 
impl="org.apache.cocoon.pipeline.producer.xml.URIProducer" 
output-format="/text/xml"/>
                ...
        </map:producers>

Usage examples:
        <map:produce type="uri" output-format="/binary/ole/ms-word" 
ref="docs/{1}.doc"/>
        <map:produce type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/>


6.2 Consumers
-------------

They consume a data stream, possibly by writing it to a data repository. 
Consumers are used if no data should be produced by the pipeline and are 
usually placed at the end of a pipeline.
For a typical use of consumers in a web environment, some result has to be 
sent back to the client. Here I'd suggest to use <map:redirect/> to redirect 
to another pipeline (perhaps depending on the result of the producer -> 
success/error).

Component definition:
        <map:consumers default="uri">
                ...
                <map:consumer name="uri" 
impl="org.apache.cocoon.pipeline.consumer.URIConsumer" input-format="any"/>
                <map:consumer name="xml/uri" 
impl="org.apache.cocoon.pipeline.consumer.xml.URIConsumer" 
input-format="/text/xml"/>
                <map:consumer name="http/response" 
impl="org.apache.cocoon.pipeline.consumer.http.ResponseConsumer" 
input-format="/text/xml"/>
                ...
        </map:consumers>

Usage example (with redirection):
        <map:consume type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/>
        <!-- map:branch is explained below under "Branches" -->
        <map:branch type="status">
                <map:case match="success">
                        <map:redirect-to ref="success-page"/>
                </map:case>
                <map:default>
                        <map:redirect-to ref="error-page"/>
                </map:default>
        </map:branch>


6.3 Converters
--------------

They convert a data stream from one data format into an other one.

Component definition:
        <map:converters default="http/response">
                ...
                <map:converter name="http/response" 
impl="org.apache.cocoon.pipeline.converter.http.ResponseConverter" 
input-format="any" output-format="/text/http/response"/>
                <map:converter name="xhtml2html" 
impl="org.apache.cocoon.pipeline.converter.xml.XHTML2HTMLConverter" 
input-format="/text/xml/xhtml" output-format="/text/sgml/html"/>
                ...
        </map:converters>

This example converts XHTML to HTML:
        <map:convert type="xhtml2html">

This example converts any data format to a HTTP response (without delivering 
it; this is the task of the consumer "http/response"!):
        <map:convert type="http/response">


6.4 Filters
-----------

They modify a data stream while keeping the data format.

Component definition:
        <map:filters default="xml/xslt">
                ...
                <map:filter name="xml/xslt" 
impl="org.apache.cocoon.pipeline.filter.XSLTFilter" input-format="/text" 
output-format="/text"/>
                <!-- unix grep (regular expression filter) -->
                <map:filter type="text/grep" 
impl="org.apache.cocoon.pipeline.filter.text.GrepFilter" input-format="/text" 
output-format="/text"/>
                <!-- unix wc (word count) -->
                <map:filter type="text/wc" 
impl="org.apache.cocoon.pipeline.filter.text.WordCount" input-format="/text" 
output-format="/text"/>
                ...
        </map:filters>

Usage examples:
        <map:filter type="xml/xslt" ref="stylesheets/news2page.xsl">
        <map:filter type="xml/xslt" ref="stylesheets/page2xhtml.xsl" 
output-format="/text/xml/xhtml">
        <map:filter type="text/grep">
                <map:parameter name="pattern" value="my grep pattern"/>
        </map:filter>
        <map:filter type="text/wc">
                <map:parameter name="mode" value="linecount"/>
        </map:filter>

The second filter might seem to you like a converter, but the output format is 
still compatible to "/text/xml" ("/text/xml/xhtml" is derived from 
"/text/xml") and thus can be treated as filters.

Theoretically you can do the same work of a filter by using a converter, but 
it's often not that what people intend to do. Why should they use a converter 
when they want to filter the data? Practically a Filter is a special case of 
a converter, where the input- and output-format are equivalent. So it might 
be possible, that a filter with the data format "/text/xml" is just an alias 
for <map:convert input-format="/text/xml" output-format="/text/xml" .../> 
while keeping the sitemap simpler to understand.


6.5 Aggregators
---------------

They aggregate multiple data streams of the same format into one data stream. 
There can be multiple implementations of aggregators just like this is the 
case for producers.

Component definition:
        <map:aggregators default="append">
                ...
                <map:aggregator name="append" 
impl="org.apache.cocoon.pipeline.aggregator.AppendAggregator" 
input-format="any" output-format="any"/>
                <map:aggregator name="sound/mixer" 
impl="org.apache.cocoon.pipeline.aggregator.sound.MixerAggregator" 
input-format="/abstract/sound" output-format="/abstract/sound"/>
                ...
        </map:aggregators>

Here's an example, how to aggregate different sound tracks into one:
        <map:aggregate type="sound/mixer">
                <!-- All parts have the same output-format ("/abstract/sound") -->
                <map:part ref="song/drums">
                        <map:parameter name="volume" value="0.8"/>
                </map:part>
                <map:part ref="song/keyboard">
                        <map:parameter name="volume" value="0.7"/>
                </map:part>
                <map:part ref="song/guitar">
                        <map:parameter name="volume" value="0.8"/>
                </map:part>
                <map:part ref="song/bass">
                        <map:parameter name="volume" value="0.7"/>
                </map:part>
                <map:part ref="song/voice">
                        <map:parameter name="volume" value="1.0"/>
                </map:part>
        </map:aggregate>


6.6 Actions
-----------

They are somewhat similar to the actions already existing in Cocoon. They 
neither produce data nor consume data and therefore don't directly affect the 
data stream. They only affect the way the pipeline components work.


6.7 Redirectors
---------------

They are the same like those already in existing in Cocoon with the exception 
of renaming the attribute 'uri' to 'ref' for consistency.

Example:
        <map:redirect-to ref="redirected-page"/>

 
6.8 Matchers
------------

They have practically the same functionality. I'd suggest one extension 
though, to provide a kind of polymorphy for URLs. This way it's possible to 
write pipelines for different input data formats while using identical URLs.

Component definition:
        <map:matchers default="wildcard">
                <map:matcher name="wildcard" 
impl="org.apache.cocoon.pipeline.matcher.WildcardURIMatcher"/>
                ...
        </map:matchers>

Example with polymorphic URI matching:
        <map:pipeline input-format="/text/xml">
                <map:match pattern="upload/*">
                        <map:consume ref="xmldb:xindice://localhost:4080/db/{1}"/>
                </map:match>
        </map:pipeline>

        <map:pipeline input-format="/binary">
                <map:match pattern="upload/*">
                        <map:consume ref="files/binaries/{1}"/>
                </map:match>
        </map:pipeline>


6.9 Branches
------------

They affect the way of the data stream through the pipeline. Branches are 
somewhat similar to selectors, but they are more like control structures like 
in Java (if, switch, ... ). Matching works similar to <map:match/> 
constructs.

The expression you want to test is represented by the attribute 'test'. The 
type of test is specified by the attribute 'type' where 'xpath' may be the 
most useful type and therefore the default. You can use other types like 
'browser' for browser dependant branching.

The following example tests one value and compares it to different cases to 
determine the right choice. Every matching case will be tested and executed 
(depending on the attribute continue). If neither case matches, then the 
<map:default/> path is taken, if available. The case matcher can be compared 
to the <map:match/> component, thus different pattern types are possible 
(pattern, regexp, ...).

The <map:branch> element uses several attributes which are explained below:
 - type: Type of branch to use
 - test: Information about what should be used for branching
 - data-type: XML Schema based data type (see [3]) for correct comparison 
(esp. for dates)
 - continue: Specifies, if matching should be continued after a successful 
match

Component definition:
        <map:branches default="value">
                <!-- This selector uses the value of the attribute 'test' for 
branching -->
                <map:selector name="value" 
impl="org.apache.cocoon.pipeline.branch.ValueBranch">
                <!-- This selector uses the user agent string for branching -->
                <map:selector name="browser" 
impl="org.apache.cocoon.pipeline.branch.BrowserBranch">
                <!-- This selector uses an XPath expression for branching -->
                <map:selector name="xpath" 
impl="org.apache.cocoon.pipeline.branch.XPathBranch">
                <!-- This selector uses the error status of the last called component 
for 
branching -->
                <map:selector name="status" 
impl="org.apache.cocoon.pipeline.branch.StatusBranch">
                ...
        </map:branches>

Example:
        <map:branch type="xpath" test="/document/metadata/status" 
data-type="xsd:string" continue="false">
                <map:case match="archive">
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/archive/{1}"/>
                </map:case>
                <map:case match="live">
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/live/{1}"/>
                </map:case>
                <map:default>
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/draft/{1}"/>
                </map:default>
        </map:branch>

The next example allows more flexible tests by specifying different conditions 
in the attribute 'test' for every test case. Theoretically it's possible, 
that multiple case statements match. You can control the behavior by the 
attribute 'continue' which by default is 'false' and means, that the first 
matching case gets executed and the <map:branch>...</map:branch> block is 
left. If you set it to true, then it means, that when executing this case it 
does not leave the <map:branch/> block but also evaluates the following case 
statements. The level of granularity is left up to you: You can set 
'continue' directly in the <map:branch> element, thus setting the default 
behavior for all <map:case> elements. Additionally you can set it for certain 
<map:case> statements which should be treated special.

        <map:produce ... output-format="/text/xml"/>
        <map:branch>
                <map:case type="xpath" test="/document/metadata/online-date &lt; 
date()" 
continue="true" data-type="xsd:date">
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/live/{1}"/>
                </map:case>
                <map:case ...>
                        ...
                </map:case>
        </map:branch>


6.10 Exceptions
---------------

If some error in the pipeline occurs, you can throw and catch exceptions. This 
is necessary, since the introduction of data formats can cause problems when 
feeding a pipeline with the wrong data format. But there are many other 
cases, where exception handling in the sitemap can be useful. To make it 
easier to understand, I'll base them on the Java exceptions.

To throw an exception you can use <map:throw type="some type" message="some 
message"/> where type stands for an exception type and message an optional 
description for the exception. If you have to pass values to the exception 
you want to throw, you can use <map:parameter name="..." value="..."/> inside 
the <map:throw>...</map:throw> block. The excaption can then be caught with 
<map:catch type="some type">...</map:catch> which can be located in different 
scopes as you can see below.

Component definition:
        <map:exceptions>
                <map:exception name="data-format" 
impl="org.apache.cocoon.pipeline.exception.DataFormatException"/>
                ...
        </map:exceptions>

The order in which the scopes of the exception handlers are searched can be 
seen from the following examples:

1. Local exception handlers
        <map:pipeline>
                <map:match pattern="exception-test">
                        ...
                        <map:throw type="sometype" message="This is a message 
explaining the 
error."/>
                        ...
                        <map:catch type="sometype">
                                ...
                        </map:catch>
                </map:match>
        </map:pipeline>

2. Pipeline exception handlers
        <map:pipeline>
                <map:match pattern="exception-test">
                        ...
                        <map:throw type="sometype" message="This is a message 
explaining the 
error."/>
                        ...
                </map:match>
                        ...
                <map:exception-handlers>
                        <map:catch type="sometype">
                                ...
                        </map:catch>
                </map:exception-handlers>
        </map:pipeline>

3. Global exception handlers
        <map:pipeline>
                <map:match pattern="exception-test">
                        ...
                        <map:throw type="sometype" message="This is a message 
explaining the 
error."/>
                        ...
                </map:match>
        </map:pipeline>

        <map:exception-handlers>
                <map:catch type="sometype">
                        ...
                </map:catch>
        </map:exception-handlers>


7 Protocol Independence
=======================

Currently Cocoon is tightly bound to certain protocols by running an instance 
of it in a certain environment (servlet, CLI) and it's not (easy) possible to 
handle different invocation protocols from the same instance. To abstract the 
transport protocols (through the use of certain consumers or producers) we 
already have a good working base. What is missing is binding a protocol to a 
certain port, but we should not duplicate work here, which is better left to 
other software like Apache or Tomcat. We just need to find a way (which I'm 
sure, that already exists somewhere) to serve different ports with different 
protocols. I think the Servlet specification is general enough to not only 
support HTTP/HTTPS and can help us here.

Given the case, that we have solved the port binding issue, we need some 
abstraction of the transport protocol. What I mean here is that I'd like to 
use pipelines independent from the way the request has been sent to Cocoon 
and how it has to be sent back to the client.

To solve this we need something like a protocol handler, which maps requests 
from certain protocols to certain pipelines. The mapping itself is a very 
abstract thing and heavily depends on the used protocol.

Let's assume, we even solved the protocol handler issue, I'd like to sketch 
some possible use cases below, before we continue.


7.1 Web Services
----------------

As many of you know there are existing two popular styles to use Web Services: 
SOAP and REST.
Both have their own advantages and disadvantages but I'd like to concentrate 
on SOAP and on it's transport protocol independence, because REST-style Web 
Services are already possible to do with Cocoon.

SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly 
HTTP(S) is used therefore, but there are many cases, where you have to use 
other protocols (like SMTP, FTP, ...).
Whatever protocol you chose to invoke your Web Services the result should be 
always the same and the response should be delivered back through (mostly) 
the same protocol. Here is one of the greatest advantages of the protocol 
independance.

What you want to do now is to implement the Web Service as a bunch of 
pipelines and let the protocol handler be responsible for invoking the same 
pipeline no matter which protocol has been used.


7.2 Mail Server
---------------

Nothing hinders you to implement a mail server, which has the possibility to 
integrate various data sources and to expose it's functionality via the 
traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web 
Service, and what ever you want.


7.3 Mailing List Manager
------------------------

Mailing list managers typically provide several functions (subscribe, 
unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list 
of subscribed users. Once again, you can write such a service once and expose 
it's functionality through traditional protocols (HTTP, SMTP, ...) but also 
as Web Service.


7.4 What else?
--------------

Perhaps you realize that this way you are free to implement every application 
you want by the use of the easy declarative pipeline processing concept. How 
to connect your application to the world outside is a seperate issue which 
you can decide later and specify independant from the application.


8 Protocol Handler
==================

This component has been mentioned several times now, so it is time to try to 
explain it in more detail.
Currently Cocoon pipelines are primary written for HTTP communication. A 
request is sent from a client to the server and enters a certain pipeline via 
the <map:match/> statements. The end of a pipeline always generates the 
response which is sent back to the client. As you can see, even if you can 
run Cocoon theoretically in several environments, the servlet environment 
with the HTTP(S) protocol is the one which used in most cases. So most 
pipelines are dependant on the HTTP protocol.

I'd suggest to introduce an abstraction layer between direct pipeline 
invocation and the request from the client through a certain protocol. I'll 
try my best to make this as clear as possible ...


8.1 Component Definition
------------------------

Let's begin by defining the protocol handlers in the <map:components/> 
section:
        <map:protocols default="http">
                <map:protocol name="http" 
impl="org.apache.cocoon.protocol.HTTPProtocol"/>
                <map:protocol name="https" 
impl="org.apache.cocoon.protocol.HTTPSProtocol"/>
                <map:protocol name="ftp" 
impl="org.apache.cocoon.protocol.FTPProtocol"/>
                <map:protocol name="smtp" 
impl="org.apache.cocoon.protocol.SMTPProtocol"/>
                <map:protocol name="pop3" 
impl="org.apache.cocoon.protocol.POP3Protocol"/>
                <map:protocol name="imap" 
impl="org.apache.cocoon.protocol.IMAPProtocol"/>
                ...
        </map:protocols>


8.2 Protocol Binding
--------------------

After we have all possible protocols defined, we have to bind them to certain 
ports.

Here I'd suggest the following:
        <map:bindings>
                <map:bind protocol="http"  port="80"/>
                <map:bind protocol="http"  port="8080"/>
                <map:bind protocol="ftp"   port="21"/>
                <map:bind protocol="https" port="443"/>
                <map:bind protocol="smtp"  port="25"/>
                <map:bind protocol="pop3"  port="110"/>
                <map:bind protocol="pop3s" port="995"/>
                <map:bind protocol="imap"  port="143"/>
                <map:bind protocol="imaps" port="993"/>
        </map:bindings>

Tomcat, for example, already does such kind of binding in the config file 
server.xml. Perhaps we don't really need this protocol mapping in Cocoon, but 
we should check first, if we can get all the information we need from the 
servlet container in a portable way (without depending on Tomcat!).


8.3 The Handler's Task
----------------------

Well, what does a protocol handler actually do?

First it knows how to communicate with a certain protocol. That's obviously 
the most important thing but that's not enough for us. 

The second task is to determine which pipeline has to be invoked. It does this 
on the basis of the information it gets from the request and decides by the 
use of certain mapping rules which pipeline has to be invoked.

The third task is to automatically provide a producer or consumer, depending 
on the request or response and the pipeline which has to be invoked.


8.4 Mapping to Pipelines
------------------------

Mapping a request from a certain protocol to a certain pipeline can be a 
difficult task and depends heavily on the protocol itself. So I can only give 
you an example of a possibile solution.

        <map:mappings>
                <map:protocol name="http">
                        <!-- maps the URI of all http requests directly to all 
pipelines -->
                        <map:map type="request-uri" from="**" to="**"/>
                        <map:pipeline type="request"> <!-- The components of this 
pipeline are 
executed before the sitemap pipeline components -->
                                <map:produce type="http/request" 
output-format="/text/http/request"/>
                                <map:convert type="http/request2any" 
inpput-format="/text/http/request" 
output-format="any"/>
                        </map:pipeline>
                        <map:pipeline type="response"> <!-- The components of this 
pipeline are 
executed before the sitemap pipeline components -->
                                <map:convert type="http/any2response" 
input-format="any" 
output-format="/text/http/response"/>
                                <map:consume type="http/response" 
input-format="/text/http/response"/>
                        </map:pipeline>
                </map:protocol>
                <map:protocol name="smtp">
                        <!-- maps content of the mail header "Cocoon-Pipeline" 
directly to all 
pipelines -->
                        <map:map type="header" from="Cocoon-Pipeline: **" 
to="post/**"/>
                        <map:pipeline type="post"> <!-- The components of this 
pipeline are 
executed after the sitemap pipeline components -->
                                <map:convert type="smtp/any2post" input-format="any" 
output-format="/text/smtp"/>
                                <map:consume type="smtp" input-format="/text/smtp"/>
                        </map:pipeline>
                </map:protocol>
                <map:protocol name="pop3">
                        <!-- maps content of the mail header "Cocoon-Pipeline" 
directly to all 
pipelines -->
                        <map:map type="header" from="Cocoon-Pipeline: **" to="**"/>
                        <map:pipeline type="deliver"> <!-- The components of this 
pipeline are 
executed before the sitemap pipeline components -->
                                <map:produce type="pop3" output-format="/text/pop[3]"/>
                                <map:convert type="pop3/pop2any" 
input-format="/text/pop[3]" 
output-format="any"/>
                        </map:pipeline>
                </map:protocol>
                <map:protocol name="ftp">
                        <!-- maps the upload of a file under /home/ftp-user/upload/ to 
the 
pipelines starting with "upload/" -->
                        <map:map type="put" from="/home/ftp-user/upload/**" 
to="upload/**"/>
                        <!-- maps the download of a file under /home/ftp-user/ 
directly to all 
pipelines -->
                        <map:map type="get" from="/home/ftp-user/**" to="**"/>
                        <map:pipeline type="put"> <!-- The components of this pipeline 
are executed 
before the sitemap pipeline components -->
                                <map:produce type="ftp-put" 
output-format="/text/ftp/put"/>
                        </map:pipeline>
                        <map:pipeline type="get"> <!-- The components of this pipeline 
are executed 
before the sitemap pipeline components -->
                                <map:consume type="ftp-get" 
input-format="/text/ftp/get"/>
                        </map:pipeline>
                </map:protocol>
        </map:mappings>

The only thing I don't like here is to use <map:map/> because I'm sure that 
this will cause misunderstandings. I'd suggest to use an other namespace 
prefix.


9 Pipelines as Pipeline Components
==================================

Based on the assumptions taken so far we can define rules for pipelines, which 
implicitly make them to pipeline components themselves:


9.1 Producer Pipelines
----------------------

Pipelines which produce data and don't consume anything are called producer 
pipelines. The following example produces data in the format "/text/xml", but 
does not consume any data, so it must have a producer component at the 
beginning of the pipeline but no consumer at the end.

Example:
        <map:pipeline output-format="/text/xml">
                <map:match pattern="producer-pipeline">
                        <map:produce ... />
                        ...
                </map:match>
        </map:pipeline>

You can use this pipeline as a producer in other pipelines by writing:
        <map:produce ref="cocoon:/producer-pipeline"/>


9.2 Consumer Pipelines
----------------------

Pipelines which consume data and don't produce data are called consumer 
pipelines. The following example consumes data in the format "/text/xml", but 
does not produce any data, so it must have a consumer component at the end of 
the pipeline but no producer at the beginning.

Example:
        <map:pipeline input-format="/text/xml">
                <map:match pattern="consumer-pipeline">
                        ...
                        <map:consume ... />
                </map:match>
        </map:pipeline>

You can use this pipeline as a consumer in other pipelines by writing:
        <map:consume ref="cocoon:/consumer-pipeline"/>


9.3 Converter Pipelines
-----------------------

Pipelines which consume a certain data format and produce a certain 
(different) data format are called converter pipelines. The following example 
converts data from the format "/text/xml/xhtml" to "/text/sgml/html", so it 
neither has a producer at the beginning of the pipeline nor a consumer at the 
end of the pipeline.

Example:
        <map:pipeline input-format="/text/xml/xhtml" output-format="/text/sgml/html">
                <map:match pattern="converter-pipeline">
                        ...
                </map:match>
        </map:pipeline>

You can use this pipeline as a converter in other pipelines by writing:
        <map:convert ref="cocoon:/consumer-pipeline"/>


9.4 Filter Pipelines
--------------------

Pipelines which consume a certain data format and produce a the same (or a 
compatible) data format are called converter pipelines. The following example 
filters data with the format "/text/xml", so it neither has a producer at the 
beginning of the pipeline nor a consumer at the end of the pipeline.

Example:
        <map:pipeline input-format="/text/xml" output-format="/text/xml">
                <map:match pattern="filter-pipeline">
                        ...
                </map:match>
        </map:pipeline>

You can use this pipeline as a filter in other pipelines by writing:
        <map:filter ref="cocoon:/filter-pipeline"/>


9.5 Action Pipelines
--------------------

Pipelines which neither consume nor produce data are called action pipelines. 
They can produce data internally through a producer and consume it again with 
a consumer, but no data from outside of the pipeline is flowing in or out.

Example:
        <map:pipeline>
                <map:match pattern="action-pipeline">
                        <map:produce ... />
                        ...
                        <map:consume ... />
                </map:match>
        </map:pipeline>

You can use this pipeline as an action in other pipelines by writing:
        <map:act ref="cocoon:/action-pipeline"/>


10 Configuration Files
======================

With so many new sitemap declarations it is hard to keep the sitemap 
managable. To solve this problem I'd suggest to split it up in different 
files, which all deal with separate concerns.


10.1 cocoon.xconf
-----------------

This configuration file has the same functionality like in current cocoon 
versions. It's main purpose is to register and configure avalon components.


10.2 components.xconf
---------------------

In this file all the pipeline components are defined (see section "6 Pipeline 
Components").
It uses it's own namespace (e.g. http://apache.org/cocoon/component/1.0).


10.3 protocols.xconf
--------------------

In this file all the protocols are defined (see section "8 Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/protocol/1.0).


10.4 bindings.xconf
-------------------

In this file all the protocol port bindings are defined (see section "8 
Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/binding/1.0).


10.5 protocol-mappings.xconf
----------------------------

In this file the mapping to sitemap pipelines are defined (see section "8 
Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/mapping/1.0).


10.6 data-formats.xconf
----------------------

In this file all the data formats are defined (see section "5 Data Formats").
It uses it's own namespace (e.g. http://apache.org/cocoon/format/1.0).


10.7 sitemap.xmap
-----------------

This file holds all the pipelines (see section "6 Pipeline Components").
It uses it's own namespace (e.g. http://apache.org/cocoon/sitemap/3.0).

To be more flexible the content of the configuration files can be placed 
inside the sitemap. This will make it easier for small sitemaps. For large 
sitemaps I'd suggest to use references to those files instead, to keep the 
configuration managable. This way you can even share the same files for 
different sitemaps just by referencing the same config file.

Here's a rough sketch of the structure from sitemap.xmap:

<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/3.0";>
        ...
        <map:components> <!-- optional: ref="components.xconf" -->
                <map:protocols ref="protocols.xconf"/>
                
                <map:bindings ref="bindings.xconf" />
                
                <map:formats ref="formats.xconf" />
                
                <map:mappings ref="mappings.xconf" />
                
                <map:producers ... />
                
                <map:consumers ... />
                
                <map:converters ... />
                
                <map:filters ... />
                
                <map:exceptions ... />
        </map:components>
        ...
</map:sitemap>

All sub elements of <map:components> can place their configuration directly as 
sub elements inside the sitemap or can be swapped out to external files which 
are referenced by the ref="..." attribute.

I'm still unsure if we should really place everything below <map:components>, 
since there are some configurations involved which don't specify new 
components (e.g. bindings and mappings). Perhaps we can find a more 
meaningful element name or split it up into different sections. Let's see 
what some discussion on this topic will bring us ...


10.8 Config File Hierarchy
--------------------------

Here's an overview on the hierarchy of the config file as it looks for now:

cocoon.xconf (references the main sitemap.xmap with the treeprocessor 
declaration)
|
+-sitemap.xmap
  |
  +-components.xconf
    |
    +-protocols.xconf
    |
    +-bindings.xconf
    |
    +-mappings.xconf
    |
    +-formats.xconf
    |
    +-producers.xconf
    |
    +-consumers.xconf
    |
    +-converters.xconf
    |
    +-filters.xconf
    |
    +-exceptions.xconf
  

11 Converting old sitemaps to new sitemaps
==========================================

Some of you might be interested, if this new concept is flexible enough to 
provide at least the same functionality as Cocoon does today. I'll give you 
some examples, about how old pipeline components can be translated to the new 
pipeline components.

The most important thing to remember is, that all of the old pipeline 
components (except the reader) work with the data format "/text/xml" or 
derived formats. So theoretically the old implementation of the new 
components does not differ very much from their new implementation.


11.1 Generators
---------------

This is simply a producer which takes no input data and produces the 
output-format "/text/xml".

Here's an example:
        <map:generate type="file" src="doc/{1}.xml"/>
Maps to:
        <map:produce type="uri" ref="doc/{1}.xml" output-format="/text/xml"/>

You can also think of an XMLProducer, where the output-format is implicitly 
set to "/text/xml", so you don't have to provide it every time you use the 
producer. Of course this applys to all other components too.


11.2 Transformers
-----------------

They simply consume XML and produce XML, so they are actually filters.

Here's an example:
        <map:transform type="xslt" src="stylesheets/news2xhtml.xsl"/>
Maps to:
        <map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl"/>

Since filters don't change the data format, you don't need to specify the 
input- and output-format, because they are either specified implicitly in the 
component definition, or default to the input/output-format of the 
surrounding pipeline components.


11.3 Readers
------------

They simply read a file and deliver it, so they are actually producers.

Here's an example:
        <map:read src="welcome/cocoon.gif" mime-type="image/gif"/>
Maps to:
        <map:produce ref="welcome/cocoon.gif" output-format="/binary/gif"/>

NOTE 1:
The MIME type is implicitly contained in every data format. So the 
output-format "/binary/gif" results in the MIME type "image/gif".

NOTE 2:
There's one difference between the reader and the producer concerning the 
delivering of resources. The reader actually delivered them after reading, 
which is not the case with the producer. This is actually done automatically 
by the protocol handler which appends certain (configurable) pipeline 
components to consumer pipelines (see section "8 Protocol Handler").


11.4 Serializers
----------------

They definitely convert XML to an other format and therefore behave like 
converters.

Here's an example:
        <map:serialize type="svg2png" mime-type="image/png"/>
Maps to:
        <map:convert type="svg2png" input-format="/text/xml/svg" 
output-format="/binary/png"/>

The other tasks of a serializer, like preparing the response of the pipeline 
(HTTP headers, mime-type, ...), is done by the respective protocol handlers, 
which for example append the following components to the end of the consumer 
pipeline (see section "8 Protocol Handler"):
        <map:convert type="http/any2response" input-format="any" 
output-format="/text/http/response"/>
        <map:consume type="http/response" input-format="/text/http/response"/>


11.5 Selectors
--------------

The functionality of <map:select>...</map:select> is fully supported by the 
more flexible <map:branch>...</map:branch> concept and can be easily 
converted.

Here's an example:
        <map:select type="browser">
                <map:when test="wap">
                        ...
                </map:when>
                <map:when test="netscape">
                        ...
                </map:when>
                <map:otherwise>
                        ...
                </map:otherwise>
        </map:select>
Maps to:
        <map:branch type="browser">
                <map:case match="wap">
                        ...
                </map:case>
                <map:case match="netscape">
                        ...
                </map:case>
                <map:default>
                        ...
                </map:default>
        </map:branch>


12 Use Cases
============

This section gives you some examples which show you the possibilities of this 
proposed architecture.

NOTE:
For better understanding I've included the input/output-format attributes to 
some of the pipeline components which makes them easier to understand. Keep 
in mind, that you don't need to specify them every time. Usually you'll only 
define them once per component in the components section or they are 
implicitely set by surrounding components or the pipeline itself.


12.1 File Upload
----------------

This example uploads a HTML news file, extracts xml content and stores it in 
an XML database.

        <map:pipeline input-format="/text/sgml/html">
                <map:match pattern="upload/news/*.html">
                        <map:convert type="html2xhtml" 
output-format="/text/xml/xhtml"/>
                        <map:filter type="xml/xslt" ref="xhtml2news.xsl" 
output-format="/text/xml/newsml"/>
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
                </map:match>
        </map:pipeline>


12.2 Combining several pipelines
--------------------------------

In this example we are combining 3 pipelines:

1. This one generates data in a certain format:

        <map:pipeline output-format="/text/sgml/html">
                <map:match pattern="news/*.html">
                        <map:produce type="uri" ref="documents/news/{1}.xml" 
output-format="/text/xml/newsml"/>
                        <map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl" 
output-format="/text/xml/xhtml"/>
                        <map:convert type="xhtml2html" 
output-format="/text/sgml/html"/>
                </map:match>
        </map:pipeline>

2. This one consumes data in a certain format:

        <map:pipeline input-format="/text/xml/xhtml">
                <map:match pattern="upload/news/*.html">
                        <map:convert type="html2xhtml" 
output-format="/text/xml/xhtml"/>
                        <map:filter type="xml/xslt" ref="xhtml2news.xsl" 
output-format="/text/xml/newsml"/>
                        <map:consume type="uri" 
ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
                </map:match>
        </map:pipeline>

3. This one references both pipelines and combines them into a new one:

        <map:pipeline>
                <map:match pattern="replicate/news/*.html">
                        <map:produce type="uri" ref="cocoon:/news/{1}.html"/>
                        <map:consume type="uri" ref="cocoon:/upload/news/{1}.html"/>
                </map:match>
        </map:pipeline>


12.3 Unix Pipes
---------------

This is a universal filter pipeline, which counts the number of lines of text 
data flowing through the pipeline. The optional argument can be used to grep 
each line.

        <map:pipeline input-format="/text" output-format="/text">
                <map:match pattern="filter/count/lines/**">
                        <map:filter type="text/grep"> <!-- unix grep (regular 
expression filter) 
-->
                                <map:parameter name="pattern" value="{1}"/>
                        </map:filter>
                        <map:filter type="text/wc"> <!-- unix wc (word count) -->
                                <map:parameter name="mode" value="linecount"/>
                        </map:filter>
                </map:match>
        </map:pipeline>

This pipeline uses the filter from above to analyze Apache's access_log for 
certain requests:

        <map:pipeline output-format="/text">
                <map:match pattern="statistics/forms/*">
                        <map:produce ref="file:///var/log/httpd/access_log"/> <!-- 
like unix cat 
(list file contents) -->
                        <map:filter 
ref="cocoon:/filter/count/lines/forms/login.html"/> <!-- unix 
grep (regular expression filter) -->
                        <!-- Result is the number of requests to the file 
/forms/login.html in the 
Apache access log -->
                </map:match>
        </map:pipeline>


12.4 Image Processing
---------------------

This pipeline takes several image formats and converts them to the abstract 
image format, which can be used by format-independent image filters:
        
        <!-- Since we don't know the concrete image format for the input we have to 
use 'any' -->
        <map:pipeline input-format="any" output-format="/abstract/image">
                <map:match pattern="convert/to-image/*.*">
                        <map:branch test="{2}">
                                <map:case match="jpg|jpeg|JPG|JPEG">
                                        <map:convert type="jpg2image" 
input-format="/binary/jpeg"/>
                                </map:case>
                                <map:case match="gif|GIF">
                                        <map:convert type="gif2image" 
input-format="/binary/gif"/>
                                </map:case>
                                <map:default>
                                        <map:throw type="input-format" message="{2} is 
not a supported input 
image type."/>
                                </map:default>
                        </map:branch>
                </map:match>
        </map:pipeline>

This pipeline takes the abstract image format and converts it to certain 
specific image formats:
        
        <!-- Since we don't know the concrete image format for the output we have to 
use 'any' -->
        <map:pipeline input-format="/abstract/image" output-format="any">
                <map:match pattern="convert/from-image/*.*">
                        <map:branch test="{2}">
                                <map:case match="jpg|jpeg|JPG|JPEG">
                                        <map:convert type="image2jpg" 
output-format="/binary/jpeg"/>
                                </map:case>
                                <map:case match="gif|GIF">
                                        <map:convert type="image2gif" 
output-format="/binary/gif"/>
                                </map:case>
                                <map:default>
                                        <map:throw type="output-format" message="{2} 
is not a supported output 
image type."/>
                                </map:default>
                        </map:branch>
                </map:match>
        </map:pipeline>

This is an example for an abstract image filter pipeline, which is independent 
from the specific image data format. It prepares an image for character 
recognition:
        
        <map:pipeline input-format="/abstract/image" output-format="/abstract/image">
                <map:match pattern="filter/image/prepare-ocr">
                        <map:filter type="image/histogram">
                                <map:parameter name="equalize" value="full"/>
                        </map:filter>
                        <map:filter type="image/2greyscale" />
                        <map:filter type="image/2bw">
                                <map:parameter name="method" value="threshold"/>
                                <map:parameter name="level" value="0.5"/>
                        </map:filter>
                </map:match>
        </map:pipeline>

This pipeline invokes the pipelines from above and shows how these pipelines 
can be reused as pipeline components themselfes:
        
        <!-- Since we don't know the image format we have to use 'any' as input and 
output format -->
        <map:pipeline input-format="any" output-format="any">
                <map:match pattern="filter/any-image/prepare-ocr/*">
                        <map:convert ref="cocoon:/convert/to-image/{1}"/>
                        <map:filter ref="cocoon:/filter/image/prepare-ocr"/>
                        <map:convert ref="cocoon:/convert/from-image/{1}"/>
                        <!--
                                Since the output format of the converter above is a 
certain image data 
format,
                                it overrides the default for this pipeline (any).
                        -->
                </map:match>
        </map:pipeline>


12.5 PDF decompiling
--------------------

This pipeline decompiles a PDF document into an intermediate XML format (see 
[4]), transforms it to a custom XML format (extract data) and stores it to an 
XML database. Depending on the success state different, the client gets 
redirected to different response pages.

        <map:pipeline input-format="/binary/pdf">
                <map:match pattern="import/*.pdf">
                        <map:convert type="pdf2xml" output-format="/text/xml/pdf-xml"/>
                        <!-- Here we have an intermediate XML stream -->
                        <map:filter type="xml/xslt" 
ref="stylesheets/pdfxml2docxml.xsl"/>
                        <!-- Here we have an XML stream with the extracted information 
-->
                        <map:consume type="uri" 
dest="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
                        <map:branch type="consume/status">
                                <map:when test="success">
                                        <map:redirect-to uri="success-page"/>
                                </map:when>
                                <map:default>
                                        <map:redirect-to uri="error-page"/>
                                </map:default>
                        </map:branch>
                </map:match>
        </map:pipeline>


12.6 Music Processing
---------------------

This pipeline generates a printable music score from a MIDI file (without 
XML):
        
        <map:pipeline input-format="/binary/midi" output-format="/binary/pdf">
                <map:match pattern="convert/midi2pdf/*">
                        <map:convert type="midi2musitex" 
output-format="/text/tex/musixtex"/>
                        <map:convert type="tex2dvi" input-format="/text/tex" 
output-format="/binary/dvi"/>
                        <map:convert type="dvi2pdf" output-format="/binary/pdf"/>
                </map:match>
        </map:pipeline>

The next pipeline uses MidiXML, an XML format which part of MusicXML and is 
available for representing music data (see [5] and [6]). It converts the 
binary MIDI format to MidiXML, selects the keyboard channel, transposes it 5 
pitches up and converts it back to the midi format.
        
        <map:pipeline input-format="/binary/midi" output-format="/binary/midi">
                <map:match pattern="filter/custom/*">
                        <map:convert type="midi2xml" 
output-format="/text/xml/midixml"/>
                        <map:filter type="midixml/select-channel">
                                <map:parameter name="name" value="keyboard"/>
                        </map:filter>
                        <map:filter type="midixml/transpose">
                                <map:parameter name="value" value="+5"/>
                        </map:filter>
                        <map:convert type="xml2midi" output-format="/binary/midi"/>
                </map:match>
        </map:pipeline>


13 Conclusion
=============

You might ask, why should we change so much from Cocoon?

First I think the new components are much more flexible and at least as easy 
to understand as the old ones: If you want to produce a data stream you use a 
producer, if you want to consume it you use a consumer, if you want to 
convert it you use a converter and if you want to filter it you use a filter. 
To control the data flow you can use the <map:branch/> component.

A possible migration path could be to support both sitemap versions, since the 
pipeline components either have different names or provide the same 
functionality. So a new sitemap implementation could be backward compatible 
to older sitemap versions. This could make the transition for the user as 
easy as possible.

Additionally it might be possible to provida a migration script (e.g. via XSL) 
which reads an old sitemap and converts it to the new format. Since 
everything from the old sitemap can be expressed in the new sitemap and can 
be formally translated (see section "11 Converting old sitemaps to new 
sitemaps") this should not be a big issue.


14 TODO
=======

1. Which concrete role do the data handlers play?
   Do we need an input and output data handler or just one?
   Do we need data handlers at all?
2. Define and manage a list of data formats (central internet repository?)
   Perhaps it's possible to coordinate the work for MIME types and data 
formats.
3. The number of components possibly explodes very fast.
   Therefore we should take care to design good package structures and 
namespaces to overcome this problem.
4. The protocol handlers have to be worked out more precisely.
5. The parameters of data format actually reflect its meta data.
   Support for RDF/OWL (see [7] and [8]) would definitely make sense to get 
one step further to the semantic web.


15 References
=============

[1] [RT] Input Pipelines (long) (thread on cocoon-dev initiated by Daniel 
Fangerstrom on Dec 17th 2002)
    http://www.mail-archive.com/cocoon-dev@xml.apache.org/msg25503.html
[2] MIME Media Types
    http://www.iana.org/assignments/media-types/
[3] XML Schema Datatypes
    http://www.w3.org/TR/xmlschema-2/
[4] JPedal
    Open Source library written in Java which can extract data from PDF 
documents.
    http://www.jpedal.org/
[5] MusicXML
    http://www.recordare.com/xml.html
[6] XEMO
    http://www.xemo.org/
    Project XEMO is an open source, modular software environment for the 
development and delivery of interactive music, audio and sound applications. 
It is written in Java and supports MusicXML.
[7] RDF - Resource Description Framework
    http://www.w3.org/RDF/
[8] OWL - Web Ontology Language based on RDF
    http://www.w3.org/TR/2002/WD-owl-ref-20021112/


16 Appendix
===========

This new architecture opens up a whole new way of flexibility and integration 
of data processing which has been already possible for XML processing. Here 
I'd give you an idea of some further data formats and components and I'm sure 
you can think of even more. Remember: Only your mind is the limit ;-)


16.1 Data Formats
-----------------

In this section you can find a proposed list of data formats, which gives an 
overview about how they could be structured.

No data format:
 - none (used, if nothing is produced/consumed)

Super data format:
 - any (base data format for abstract, binary and text)

Abstract data formats (used by components which are independent from concrete 
file format):
 - /abstract/image
 - /abstract/music
 - /abstract/sound
 - /abstract/vector
 - /abstract/vector/3d
 - /abstract/video

Binary data formats:
 - /binary
 - /binary/au
 - /binary/avi
 - /binary/avi/indeo
 - /binary/avi/indeo[4.1]
 - /binary/avi/indeo[5.0]
 - /binary/avi/divx
 - /binary/bmp
 - /binary/bmp/os2
 - /binary/bmp/windows
 - /binary/elf
 - /binary/elf/executable
 - /binary/elf/shared
 - /binary/gif
 - /binary/gif[87a]
 - /binary/gif[89a]
 - /binary/mp3
 - /binary/mpeg
 - /binary/ogg
 - /binary/ole
 - /binary/ole/msexcel
 - /binary/ole/mspowerpoint
 - /binary/ole/msword
 - /binary/tiff
 - /binary/tiff/jpeg
 - /binary/tiff/lzw
 - /binary/tiff/packbits
 - /binary/tiff/zip
 - /binary/wav
 - /binary/...

Text data formats:
 - /text
 - /text/http
 - /text/http/request
 - /text/http/request[0.9]
 - /text/http/request[1.0]
 - /text/http/request[1.1]
 - /text/http/response
 - /text/http/response[0.9]
 - /text/http/response[1.0]
 - /text/http/response[1.1]
 - /text/sgml
 - /text/sgml/docbook
 - /text/sgml/docbook/simple
 - /text/sgml/html
 - /text/sgml/html[3.0]
 - /text/sgml/html[4.0]
 - /text/sgml/html[4.1]
 - /text/sgml/html/frameset
 - /text/sgml/html/strict
 - /text/sgml/html/transitional
 - /text/tex
 - /text/tex/latex
 - /text/tex/musixtex
 - /text/xml
 - /text/xml/docbook
 - /text/xml/docbook/simple
 - /text/xml/rdf
 - /text/xml/rdf/rss
 - /text/xml/svg
 - /text/xml/xhtml
 - /text/xml/xhtml[1.0]
 - /text/xml/xhtml[1.1]
 - /text/...


16.2 Pipeline Components
------------------------

Image Processing (bringing Photoshop to Cocoon ;-):
 - BlurFilter
 - AquarellFilter
 - NoiseFilter
 - SharpenFilter
 - ExtrudeFilter
 - ReliefFilter
 - HistogramFilter
 - ...

Sound Processing (bringing Arts/SOX/Cubase to Cocoon ;-):
 - EqualizerFilter
 - DistortionFilter
 - ChorusFilter
 - DelayFilter
 - FlangerFilter
 - VolumeFilter
 - PitchShifterFilter
 - MixerAggregator
 - SequenceAggregator
 - MP32SoundConverter
 - Sound2MP3Converter
 - Ogg2SoundConverter
 - Sound2OggConverter
 - ...

Video Processing (bringing Premiere to Cocoon ;-):
 - BlendingAggregator
 - MixerAggregator
 - EffectsFilter
 - AVI2VideoConverter
 - Video2AVIConverter
 - MPG2VideoConverter
 - Video2MPGConverter
 - ...

For video processing it would be nice to be able to process the audio part of 
the video with sound processing components and the image part of the video 
with the image processing components (maximum component reuse!). This demands 
that the abstract video data format is composed of the abstract sound format 
and a sequence of abstract image formats which is done by extending both 
/abstract/image and /abstract/sound formats in the declaration of 
/abstract/video (see section "5 Data Formats").


Vector Graphics Processing (bringing Corel Draw/Illustrator to Cocoon ;-):
 - BooleanFilter (Union, intersection, ...)
 - TranslationFilter (Move, rotate, resize, ...)
 - VectorAggregator (Aggregate different vector graphics)
 - SVG2VectorConverter
 - Vector2SVGConverter
 - WMF2VectorConverter
 - Vector2WMFConverter
 - CDR2VectorConverter
 - Vector2CDRConverter
 - ...

Music Processing (bringing Arts/Cubase/Capella/Sibelius/Finale to Cocoon ;-):
 - PitchShifterFilter
 - Midi2MusicConverter
 - Music2MidiConverter
 - Music2ImageConverter (render music score for printing)
 - Image2MusicConverter (you know Capella Scan?)
 - Music2SoundConverter (render music to synthesized sound)
 - Sound2MusicConverter (extract music data from sound data)
 - ...

3D Graphics Processing (bringing 3D Studio/POV-Ray to Cocoon ;-):
 - TranslationFilter (Move, rotate, resize, ...)
 - 3DAggregator (Aggregate different 3D graphics)
 - ParticleFilter
 - ExplosionFilter
 - 3DS23DConverter
 - 3D23DSConverter
 - DXF23DConverter
 - 3D2DXFConverter
 - POV23DConverter
 - 3D2POVConverter
 - 3D2ImageConverter (render an image)
 - 3D2VideoConverter (render an animated scene)
 - ...



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to