Hi Cocooners! Sorry for this (very) long proposal below, but I think it's definitely worth a read. If not, at least you can give me some feedback about your opinion ;-)
Bye, Andreas Hochsteger 1 Contents ========== 1 Contents 2 Prologue 3 Introduction 4 Pipeline Types 5 Data Formats 5.1 Data Format Definition 5.2 Inheritance 5.3 A word about MIME Types 5.4 Data Handlers 5.5 Data Format Determination 6 Pipeline Components 6.1 Producers 6.2 Consumers 6.3 Converters 6.4 Filters 6.5 Aggregators 6.6 Actions 6.7 Redirectors 6.8 Matchers 6.9 Branches 6.10 Exceptions 7 Protocol Independence 7.1 Web Services 7.2 Mail Server 7.3 Mailing List Manager 7.4 What else? 8 Protocol Handler 8.1 Component Definition 8.2 Protocol Binding 8.3 The Handler's Task 8.4 Mapping to Pipelines 9 Pipelines as Pipeline Components 9.1 Producer Pipelines 9.2 Consumer Pipelines 9.3 Converter Pipelines 9.4 Filter Pipelines 9.5 Action Pipelines 10 Configuration Files 10.1 cocoon.xconf 10.2 components.xconf 10.3 protocols.xconf 10.4 bindings.xconf 10.5 protocol-mappings.xconf 10.6 data-formats.xconf 10.7 sitemap.xmap 10.8 Config File Hierarchy 11 Converting old sitemaps to new sitemaps 11.1 Generators 11.2 Transformers 11.3 Readers 11.4 Serializers 11.5 Selectors 12 Use Cases 12.1 File Upload 12.2 Combining several pipelines 12.3 Unix Pipes 12.4 Image Processing 12.5 PDF decompiling 12.6 Music Processing 13 Conclusion 14 TODO 15 References 16 Appendix 16.1 Data Formats 16.2 Pipeline Components 2 Prologue ========== I wrote most of this proposal and some other unfinished one while I had to stay in hospital for two weeks in the end of November 2002. Luckily I was armed with my notebook loaded with a CVS snapshot of Cocoon and the great Cocoon book from Matthew Langham and Carsten Ziegeler. So I could finally do something productive ;-) After returning home I had no time to finish it submit it to the public. In the mean time some discussion on similar topics arrived on the cocoon-dev mailing list (see [1]) and I forced myself to find some time again to work on this proposal and finally publish it on the cocoon-dev mailing list. Perhaps I'll find some time to convert it to an XML format (e.g. Docbook) and write a converter to publish it on the Cocoon Documentation Wiki, but first let's discuss a bit on the mailing list. WARNING: I have to say that this proposal is intended for open-minded people only, which aren't afraid to take a look beyond the limits. Anything I'm writing here might be totally crap for you, so fell free to ignore it, or send your flames to /dev/null ;-) If you are still interested, please join this journey to a world, where no man has gone before ... 3 Introduction ============== I like the Cocoon pipeline processing concept very much. I like it so much, that I think it is a pitty, to limit it only to XML processing (although I agree, that this is the most interresting application). I'm sure some of you wanted to be able to build applications the same way like Unix shell pipes work. Cocoon was a big step in this direction, but it was only applicable for processing XML data. There are so many cases where pipeline processing of data (no matter if it is XML, plain text or binary data) is done today but we are lacking a generic and declarative way to unify these processing steps. Cocoon is best suited for this task through it's clean and easy to understand yet powerful pipeline concept. 4 Pipeline Types ================ I tried to design several pipelines variants but after thinking a while they all were still too limited for the way I wanted them to work. So here's another try by giving some hypotheses first: 1. A pipeline can produce data 2. A pipeline can consume data 3. A pipeline can convert data 4. A pipeline can filter data 5. A pipeline can accept a certain data format as input 6. A pipeline can produce a certain data format as output 7. Pipeline components follow the same hypotheses (1-6) 8. Only pipeline components with compatible data formats can be arranged next to each other Based on these hypotheses you can construct pipelines, which just consume data, just produce data, both consume and produce data or even neither consume nor produce data (even this can make sense, as you'll see in section "9.5 Action Pipelines"). I think these hypotheses are simple enough to understand and flexible enough to base this further proposal on. So let's try ... To define a pipeline we need to be able to specify the input and output format. We can do this by the help of these two attributes: - input-format="..." - output-format="..." They additionally specify the default input format for the first processing component and the default output format for the last processing component. Example: <map:pipeline input-format="format1" output-format="format2"> ... </map:pipeline> This pipeline consumes the data format "format1" and produces the data format "format2". Which data formats are possible and how they are specified is shown in the next section. 5 Data Formats ============== With "data format" I mean something like XML, plain text, png, mp3, ... I'm not yet really sure here, how we should specify data formats, so I'll try to start with some requirements: 1. They should be easy to remember and to specify ;-) 2. It should be possible to create derived data formats (-> inheritance) 3. It should be possible to specify additional information (e.g. MIME type, DTD/Schema for XML, ...) 4. Pipelines which accept a certain data format as input can be fed with derived data formats 5. We should not reinvent standards, which are already suited for this task (but I fear, there does not yet exist something suitable) To make it easier for us to begin with the task of defining data formats, let's assume, we have three basic data formats called "abstract", "binary" and "text". The format "abstract" will be explained later, but "binary" and "text" should be clear to everyone. 5.1 Data Format Definition -------------------------- Here's a try to specify a hierarchy of data formats: <data:formats> <!-- #### Super data format #### --> <!-- The following format is the base for all other formats (-> compare to java.lang.Object) Although it is called 'any' data format this name is not prepended to the derived data formats like this is the case for all --> <data:format name="any" impl="org.apache.cocoon.data.handler.text.DefaultHandler"> <data:param-def name="mime-type" default="application/octet-stream"/> <data:param-def name="spec" default=""/> <!-- URL to the specification of this data format --> </data:format> <!-- #### Abstract data formats #### --> <data:format name="abstract" impl="org.apache.cocoon.data.handler.abstract.DefaultHandler"/> <data:format name="image" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.ImageHandler"> <data:param-def name="depth" default=""/> <data:param-def name="width" default=""/> <data:param-def name="height" default=""/> </data:format> <data:format name="music" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.MusicHandler"> <data:param-def name="channels" default=""/> </data:format> <data:format name="sound" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.SoundHandler"> <data:param-def name="samplesize" default=""/> <data:param-def name="samplerate" default=""/> <data:param-def name="channels" default=""/> </data:format> <!-- Multiple inheritance is used for video, wich extends image and sound. Is there a better way to specify multiple base formats? --> <data:format name="video" extends="/abstract/image /abstract/sound" impl="org.apache.cocoon.data.handler.abstract.VideoHandler"> <data:param-def name="framerate" default=""/> </data:format> <data:format name="vector" extends="/abstract" impl="org.apache.cocoon.data.handler.abstract.VectorHandler"> <data:param-def name="unit" default=""/> <data:param-def name="width" default=""/> <data:param-def name="height" default=""/> </data:format> <data:format name="3d" extends="/abstract/vector" impl="org.apache.cocoon.data.handler.abstract.3DHandler"> <data:param-def name="depth" default=""/> </data:format> <!-- #### Binary based data formats #### --> <data:format name="binary" impl="org.apache.cocoon.data.handler.binary.DefaultHandler"> <data:param-def name="endian" default="little"/> </data:format> <!-- MS OLE based data formats --> <data:format name="ole" extends="/binary" impl="org.apache.cocoon.data.handler.binary.ole.DefaultHandler"/> <data:format name="msword" extends="/binary/ole" impl="org.apache.cocoon.data.handler.binary.ole.MSWordHandler"/> <data:format name="msexcel" extends="/binary/ole" impl="org.apache.cocoon.data.handler.binary.ole.MSExcelHandler"/> <!-- Linux ELF based data formats --> <data:format name="binary" impl="org.apache.cocoon.data.handler.binary.DefaultHandler"> <data:param-def name="endian" default="little"/> </data:format> <data:format name="elf" extends="/binary" impl="org.apache.cocoon.data.handler.binary.elf.DefaultHandler"> <data:param-def name="architecture" default="x86"/> </data:format> <data:format name="executable" extends="/binary/elf" impl="org.apache.cocoon.data.handler.binary.elf.ExecutableHandler"/> <data:format name="shared" extends="binary/elf" impl="org.apache.cocoon.data.handler.binary.elf.SharedLibraryHandler"/> <!-- #### Text based data formats #### --> <data:format name="text" impl="org.apache.cocoon.data.handler.text.DefaultHandler"> <data:param-def name="encoding" default="UTF-8"/> <data:parameter name="mime-type" value="text/plain"/> </data:format> <data:format name="xml" extends="/text" impl="org.apache.cocoon.data.handler.xml.DefaultHandler"> <!-- this handler deals with SAX events inside the pipeline --> <data:param-def name="schema-type" default="xsd"/> <!-- other possible values: dtd, rng, schematron, ... --> <data:param-def name="schema" default=""/> <data:parameter name="mime-type" value="text/xml"/> </data:format> <data:format name="xhtml" extends="/text/xml" impl="org.apache.cocoon.data.handler.xml.XHTMLHandler"> <data:parameter name="mime-type" value="text/html"/> <data:parameter name="schema" value="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> </data:format> </data:formats> It's just a first sketch, but I think you got the idea. Above you can see the super data format 'any', some abstract, text and binary data formats, which show you how to specify inherited data formats. If no extends="..." attribute is given, it is automatically derived from the data format 'any'. References to data formats are done by using a path which specifies the respective data format. This path is built by appending the specified data format name to the path of the parent data format, separated by a slash. The super data format is an exception to this rule and is just called 'any'. It is not part of the path for derived data formats to make them shorter. It is possible to use relative data format paths too. E.g. a pipeline consumes /text/xml, a converter generates XHTML from it an thus can use output-format="xhtml" instead of output-format="/text/xml/xhtml". The name 'any' is reserved only for the super data format and it is not allowed to name derived data formats after it. 'none' is an other reserved name which is used, if a pipeline does not consume data (input-format="none") or produce data (output-format="none"). It is the default for all pipelines, if it is not overwritten by pipelines or their components. The examples from above can be used by using the following strings for specifying data formats: - any - /abstract/image - /abstract/music - /abstract/sound - /abstract/video - /abstract/vector - /abstract/vector/3d - /binary - /binary/ole - /binary/ole/msword - /binary/ole/msexcel - /binary/elf - /binary/elf/executable - /binary/elf/shared - /text - /text/xml - /text/xml/xhtml See section "16.1 Data Formats" for more examples. One enhancement of this scheme might be useful: Specification of version numbers or format variants. One way might be to append the version number to the end separated by a slash, but I think this will mix different concerns. My suggestion would be to specify them by appending the version information in brackets as the following shows: - /text/xml/xhtml[1.0] - /text/xml/xhtml[1.1] Instead of: - /text/xml/xhtml/1.0 - /text/xml/xhtml/1.1 5.2 Inheritance --------------- A pipeline which consumes a certain data format can be fed with derived data formats too. Take the following pipeline as example: <map:pipeline input-format="/text/xml"> ... </map:pipeline> This pipeline would consume the data format "/text/xml/xhtml" without problems, but leads to an exception if you feed it with the data format "/text". 5.3 A word about MIME Types --------------------------- If you ask me, why don't I use the standardized MIME types (see [2]) to specify data formats, I can give you the following reasons: MIME types fulfill the requirements from above just partly. They just support two levels of classification and they are purpose-oriented. The data formats I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml). So both serve different purposes. I know the importance of supporting the MIME type standard, and so the parameter 'mime-type' is part of the super data format 'any' and thus is available for every other data format too. By specifying a certain data format, you always have a MIME type associated, in the worst case the MIME type from the super data format 'any' (application/octet-stream) is used. 5.4 Data Handlers ----------------- I'm not very sure, what the data handlers actually do, but I can think of either defining an interface, which must be implemented by the pipeline components which operate with a certain data format (do we need two handlers here: input-handler and output-handler?) or they are concrete components which can be used by the pipeline components to consume or produce this data format. I think some discussion on this topic might not be bad. 5.5 Data Format Determination ----------------------------- In many cases, I've written the input- and output-format along with the pipeline components, but it is also possible to specify them in the <map:components/> section or implicitely by implementing a certain component interface and therefore omitting it in the pipeline. Here's a suggested order of data format determination: 1. Input-/output-Format specified directly with a pipeline component <map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/> 2. Input-/output-Format specified by the component declaration <map:filters> <map:filter name="prettyxml" input-format="/text/xml" output-format="/text/xml" ... /> </map:filters> 3. Output-/input-Format specified by the previous or following pipeline component <map:produce type="uri" ref="docs/file.xhtml" output-format="/text/xml/xhtml"/> <!-- input- and output-format="/text/xml/xhtml" from previous pipeline component --> <map:filter type="prettyxml"/> 4. Input-/output-Format specified directly with a pipeline <map:pipeline input-format="/text/xml" output-format="/text/xml"> <map:filter type="prettyxml"/> ... </map:pipeline> 5. If nothing from above matches then assume "none". 6 Pipeline Components ===================== Now that we have a big picture of the pipelines and a flexible way to specify data formats which flow through the pipelines we can move on to specify the pipeline components. To allow a fresh and clean design, abandon all known pipeline components like generators, transformers, serializers, ... and what you know about their functionality. I'll use the same names where this makes sense, but keep in mind, that we are not only talking about processing XML data, so their functionality may be different. Currently Cocoon pipeline components are all working with XML data. In this proposal the components are meant to process any data format available and I'm sure you'll agree that great care has to be taken to manage the huge ammount of possible pipeline components. One problem here is the flat specification of component names. As a solution for this I'd suggest to use hierarchical path names to specify component names and group related components under the same path. 6.1 Producers ------------- They simply produce a data stream, possibly by reading data from a data repository. Producers are used if no data is consumed from the pipeline and are usually placed at the beginning of a pipeline. Component definition: <map:producers default="uri"> ... <!-- The following producer is similar to the old file generator but can produce any data format. I renamed 'file' to 'uri' since it does not only read files, but any resource, which can be expressed by an URI and the protocol is known. --> <map:producer name="uri" impl="org.apache.cocoon.pipeline.producer.URIProducer" output-format="any"/> <!-- The next producer might be identical to the old file generator. --> <map:producer name="xml/uri" impl="org.apache.cocoon.pipeline.producer.xml.URIProducer" output-format="/text/xml"/> ... </map:producers> Usage examples: <map:produce type="uri" output-format="/binary/ole/ms-word" ref="docs/{1}.doc"/> <map:produce type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/> 6.2 Consumers ------------- They consume a data stream, possibly by writing it to a data repository. Consumers are used if no data should be produced by the pipeline and are usually placed at the end of a pipeline. For a typical use of consumers in a web environment, some result has to be sent back to the client. Here I'd suggest to use <map:redirect/> to redirect to another pipeline (perhaps depending on the result of the producer -> success/error). Component definition: <map:consumers default="uri"> ... <map:consumer name="uri" impl="org.apache.cocoon.pipeline.consumer.URIConsumer" input-format="any"/> <map:consumer name="xml/uri" impl="org.apache.cocoon.pipeline.consumer.xml.URIConsumer" input-format="/text/xml"/> <map:consumer name="http/response" impl="org.apache.cocoon.pipeline.consumer.http.ResponseConsumer" input-format="/text/xml"/> ... </map:consumers> Usage example (with redirection): <map:consume type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/> <!-- map:branch is explained below under "Branches" --> <map:branch type="status"> <map:case match="success"> <map:redirect-to ref="success-page"/> </map:case> <map:default> <map:redirect-to ref="error-page"/> </map:default> </map:branch> 6.3 Converters -------------- They convert a data stream from one data format into an other one. Component definition: <map:converters default="http/response"> ... <map:converter name="http/response" impl="org.apache.cocoon.pipeline.converter.http.ResponseConverter" input-format="any" output-format="/text/http/response"/> <map:converter name="xhtml2html" impl="org.apache.cocoon.pipeline.converter.xml.XHTML2HTMLConverter" input-format="/text/xml/xhtml" output-format="/text/sgml/html"/> ... </map:converters> This example converts XHTML to HTML: <map:convert type="xhtml2html"> This example converts any data format to a HTTP response (without delivering it; this is the task of the consumer "http/response"!): <map:convert type="http/response"> 6.4 Filters ----------- They modify a data stream while keeping the data format. Component definition: <map:filters default="xml/xslt"> ... <map:filter name="xml/xslt" impl="org.apache.cocoon.pipeline.filter.XSLTFilter" input-format="/text" output-format="/text"/> <!-- unix grep (regular expression filter) --> <map:filter type="text/grep" impl="org.apache.cocoon.pipeline.filter.text.GrepFilter" input-format="/text" output-format="/text"/> <!-- unix wc (word count) --> <map:filter type="text/wc" impl="org.apache.cocoon.pipeline.filter.text.WordCount" input-format="/text" output-format="/text"/> ... </map:filters> Usage examples: <map:filter type="xml/xslt" ref="stylesheets/news2page.xsl"> <map:filter type="xml/xslt" ref="stylesheets/page2xhtml.xsl" output-format="/text/xml/xhtml"> <map:filter type="text/grep"> <map:parameter name="pattern" value="my grep pattern"/> </map:filter> <map:filter type="text/wc"> <map:parameter name="mode" value="linecount"/> </map:filter> The second filter might seem to you like a converter, but the output format is still compatible to "/text/xml" ("/text/xml/xhtml" is derived from "/text/xml") and thus can be treated as filters. Theoretically you can do the same work of a filter by using a converter, but it's often not that what people intend to do. Why should they use a converter when they want to filter the data? Practically a Filter is a special case of a converter, where the input- and output-format are equivalent. So it might be possible, that a filter with the data format "/text/xml" is just an alias for <map:convert input-format="/text/xml" output-format="/text/xml" .../> while keeping the sitemap simpler to understand. 6.5 Aggregators --------------- They aggregate multiple data streams of the same format into one data stream. There can be multiple implementations of aggregators just like this is the case for producers. Component definition: <map:aggregators default="append"> ... <map:aggregator name="append" impl="org.apache.cocoon.pipeline.aggregator.AppendAggregator" input-format="any" output-format="any"/> <map:aggregator name="sound/mixer" impl="org.apache.cocoon.pipeline.aggregator.sound.MixerAggregator" input-format="/abstract/sound" output-format="/abstract/sound"/> ... </map:aggregators> Here's an example, how to aggregate different sound tracks into one: <map:aggregate type="sound/mixer"> <!-- All parts have the same output-format ("/abstract/sound") --> <map:part ref="song/drums"> <map:parameter name="volume" value="0.8"/> </map:part> <map:part ref="song/keyboard"> <map:parameter name="volume" value="0.7"/> </map:part> <map:part ref="song/guitar"> <map:parameter name="volume" value="0.8"/> </map:part> <map:part ref="song/bass"> <map:parameter name="volume" value="0.7"/> </map:part> <map:part ref="song/voice"> <map:parameter name="volume" value="1.0"/> </map:part> </map:aggregate> 6.6 Actions ----------- They are somewhat similar to the actions already existing in Cocoon. They neither produce data nor consume data and therefore don't directly affect the data stream. They only affect the way the pipeline components work. 6.7 Redirectors --------------- They are the same like those already in existing in Cocoon with the exception of renaming the attribute 'uri' to 'ref' for consistency. Example: <map:redirect-to ref="redirected-page"/> 6.8 Matchers ------------ They have practically the same functionality. I'd suggest one extension though, to provide a kind of polymorphy for URLs. This way it's possible to write pipelines for different input data formats while using identical URLs. Component definition: <map:matchers default="wildcard"> <map:matcher name="wildcard" impl="org.apache.cocoon.pipeline.matcher.WildcardURIMatcher"/> ... </map:matchers> Example with polymorphic URI matching: <map:pipeline input-format="/text/xml"> <map:match pattern="upload/*"> <map:consume ref="xmldb:xindice://localhost:4080/db/{1}"/> </map:match> </map:pipeline> <map:pipeline input-format="/binary"> <map:match pattern="upload/*"> <map:consume ref="files/binaries/{1}"/> </map:match> </map:pipeline> 6.9 Branches ------------ They affect the way of the data stream through the pipeline. Branches are somewhat similar to selectors, but they are more like control structures like in Java (if, switch, ... ). Matching works similar to <map:match/> constructs. The expression you want to test is represented by the attribute 'test'. The type of test is specified by the attribute 'type' where 'xpath' may be the most useful type and therefore the default. You can use other types like 'browser' for browser dependant branching. The following example tests one value and compares it to different cases to determine the right choice. Every matching case will be tested and executed (depending on the attribute continue). If neither case matches, then the <map:default/> path is taken, if available. The case matcher can be compared to the <map:match/> component, thus different pattern types are possible (pattern, regexp, ...). The <map:branch> element uses several attributes which are explained below: - type: Type of branch to use - test: Information about what should be used for branching - data-type: XML Schema based data type (see [3]) for correct comparison (esp. for dates) - continue: Specifies, if matching should be continued after a successful match Component definition: <map:branches default="value"> <!-- This selector uses the value of the attribute 'test' for branching --> <map:selector name="value" impl="org.apache.cocoon.pipeline.branch.ValueBranch"> <!-- This selector uses the user agent string for branching --> <map:selector name="browser" impl="org.apache.cocoon.pipeline.branch.BrowserBranch"> <!-- This selector uses an XPath expression for branching --> <map:selector name="xpath" impl="org.apache.cocoon.pipeline.branch.XPathBranch"> <!-- This selector uses the error status of the last called component for branching --> <map:selector name="status" impl="org.apache.cocoon.pipeline.branch.StatusBranch"> ... </map:branches> Example: <map:branch type="xpath" test="/document/metadata/status" data-type="xsd:string" continue="false"> <map:case match="archive"> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/archive/{1}"/> </map:case> <map:case match="live"> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/live/{1}"/> </map:case> <map:default> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/draft/{1}"/> </map:default> </map:branch> The next example allows more flexible tests by specifying different conditions in the attribute 'test' for every test case. Theoretically it's possible, that multiple case statements match. You can control the behavior by the attribute 'continue' which by default is 'false' and means, that the first matching case gets executed and the <map:branch>...</map:branch> block is left. If you set it to true, then it means, that when executing this case it does not leave the <map:branch/> block but also evaluates the following case statements. The level of granularity is left up to you: You can set 'continue' directly in the <map:branch> element, thus setting the default behavior for all <map:case> elements. Additionally you can set it for certain <map:case> statements which should be treated special. <map:produce ... output-format="/text/xml"/> <map:branch> <map:case type="xpath" test="/document/metadata/online-date < date()" continue="true" data-type="xsd:date"> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/live/{1}"/> </map:case> <map:case ...> ... </map:case> </map:branch> 6.10 Exceptions --------------- If some error in the pipeline occurs, you can throw and catch exceptions. This is necessary, since the introduction of data formats can cause problems when feeding a pipeline with the wrong data format. But there are many other cases, where exception handling in the sitemap can be useful. To make it easier to understand, I'll base them on the Java exceptions. To throw an exception you can use <map:throw type="some type" message="some message"/> where type stands for an exception type and message an optional description for the exception. If you have to pass values to the exception you want to throw, you can use <map:parameter name="..." value="..."/> inside the <map:throw>...</map:throw> block. The excaption can then be caught with <map:catch type="some type">...</map:catch> which can be located in different scopes as you can see below. Component definition: <map:exceptions> <map:exception name="data-format" impl="org.apache.cocoon.pipeline.exception.DataFormatException"/> ... </map:exceptions> The order in which the scopes of the exception handlers are searched can be seen from the following examples: 1. Local exception handlers <map:pipeline> <map:match pattern="exception-test"> ... <map:throw type="sometype" message="This is a message explaining the error."/> ... <map:catch type="sometype"> ... </map:catch> </map:match> </map:pipeline> 2. Pipeline exception handlers <map:pipeline> <map:match pattern="exception-test"> ... <map:throw type="sometype" message="This is a message explaining the error."/> ... </map:match> ... <map:exception-handlers> <map:catch type="sometype"> ... </map:catch> </map:exception-handlers> </map:pipeline> 3. Global exception handlers <map:pipeline> <map:match pattern="exception-test"> ... <map:throw type="sometype" message="This is a message explaining the error."/> ... </map:match> </map:pipeline> <map:exception-handlers> <map:catch type="sometype"> ... </map:catch> </map:exception-handlers> 7 Protocol Independence ======================= Currently Cocoon is tightly bound to certain protocols by running an instance of it in a certain environment (servlet, CLI) and it's not (easy) possible to handle different invocation protocols from the same instance. To abstract the transport protocols (through the use of certain consumers or producers) we already have a good working base. What is missing is binding a protocol to a certain port, but we should not duplicate work here, which is better left to other software like Apache or Tomcat. We just need to find a way (which I'm sure, that already exists somewhere) to serve different ports with different protocols. I think the Servlet specification is general enough to not only support HTTP/HTTPS and can help us here. Given the case, that we have solved the port binding issue, we need some abstraction of the transport protocol. What I mean here is that I'd like to use pipelines independent from the way the request has been sent to Cocoon and how it has to be sent back to the client. To solve this we need something like a protocol handler, which maps requests from certain protocols to certain pipelines. The mapping itself is a very abstract thing and heavily depends on the used protocol. Let's assume, we even solved the protocol handler issue, I'd like to sketch some possible use cases below, before we continue. 7.1 Web Services ---------------- As many of you know there are existing two popular styles to use Web Services: SOAP and REST. Both have their own advantages and disadvantages but I'd like to concentrate on SOAP and on it's transport protocol independence, because REST-style Web Services are already possible to do with Cocoon. SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly HTTP(S) is used therefore, but there are many cases, where you have to use other protocols (like SMTP, FTP, ...). Whatever protocol you chose to invoke your Web Services the result should be always the same and the response should be delivered back through (mostly) the same protocol. Here is one of the greatest advantages of the protocol independance. What you want to do now is to implement the Web Service as a bunch of pipelines and let the protocol handler be responsible for invoking the same pipeline no matter which protocol has been used. 7.2 Mail Server --------------- Nothing hinders you to implement a mail server, which has the possibility to integrate various data sources and to expose it's functionality via the traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web Service, and what ever you want. 7.3 Mailing List Manager ------------------------ Mailing list managers typically provide several functions (subscribe, unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list of subscribed users. Once again, you can write such a service once and expose it's functionality through traditional protocols (HTTP, SMTP, ...) but also as Web Service. 7.4 What else? -------------- Perhaps you realize that this way you are free to implement every application you want by the use of the easy declarative pipeline processing concept. How to connect your application to the world outside is a seperate issue which you can decide later and specify independant from the application. 8 Protocol Handler ================== This component has been mentioned several times now, so it is time to try to explain it in more detail. Currently Cocoon pipelines are primary written for HTTP communication. A request is sent from a client to the server and enters a certain pipeline via the <map:match/> statements. The end of a pipeline always generates the response which is sent back to the client. As you can see, even if you can run Cocoon theoretically in several environments, the servlet environment with the HTTP(S) protocol is the one which used in most cases. So most pipelines are dependant on the HTTP protocol. I'd suggest to introduce an abstraction layer between direct pipeline invocation and the request from the client through a certain protocol. I'll try my best to make this as clear as possible ... 8.1 Component Definition ------------------------ Let's begin by defining the protocol handlers in the <map:components/> section: <map:protocols default="http"> <map:protocol name="http" impl="org.apache.cocoon.protocol.HTTPProtocol"/> <map:protocol name="https" impl="org.apache.cocoon.protocol.HTTPSProtocol"/> <map:protocol name="ftp" impl="org.apache.cocoon.protocol.FTPProtocol"/> <map:protocol name="smtp" impl="org.apache.cocoon.protocol.SMTPProtocol"/> <map:protocol name="pop3" impl="org.apache.cocoon.protocol.POP3Protocol"/> <map:protocol name="imap" impl="org.apache.cocoon.protocol.IMAPProtocol"/> ... </map:protocols> 8.2 Protocol Binding -------------------- After we have all possible protocols defined, we have to bind them to certain ports. Here I'd suggest the following: <map:bindings> <map:bind protocol="http" port="80"/> <map:bind protocol="http" port="8080"/> <map:bind protocol="ftp" port="21"/> <map:bind protocol="https" port="443"/> <map:bind protocol="smtp" port="25"/> <map:bind protocol="pop3" port="110"/> <map:bind protocol="pop3s" port="995"/> <map:bind protocol="imap" port="143"/> <map:bind protocol="imaps" port="993"/> </map:bindings> Tomcat, for example, already does such kind of binding in the config file server.xml. Perhaps we don't really need this protocol mapping in Cocoon, but we should check first, if we can get all the information we need from the servlet container in a portable way (without depending on Tomcat!). 8.3 The Handler's Task ---------------------- Well, what does a protocol handler actually do? First it knows how to communicate with a certain protocol. That's obviously the most important thing but that's not enough for us. The second task is to determine which pipeline has to be invoked. It does this on the basis of the information it gets from the request and decides by the use of certain mapping rules which pipeline has to be invoked. The third task is to automatically provide a producer or consumer, depending on the request or response and the pipeline which has to be invoked. 8.4 Mapping to Pipelines ------------------------ Mapping a request from a certain protocol to a certain pipeline can be a difficult task and depends heavily on the protocol itself. So I can only give you an example of a possibile solution. <map:mappings> <map:protocol name="http"> <!-- maps the URI of all http requests directly to all pipelines --> <map:map type="request-uri" from="**" to="**"/> <map:pipeline type="request"> <!-- The components of this pipeline are executed before the sitemap pipeline components --> <map:produce type="http/request" output-format="/text/http/request"/> <map:convert type="http/request2any" inpput-format="/text/http/request" output-format="any"/> </map:pipeline> <map:pipeline type="response"> <!-- The components of this pipeline are executed before the sitemap pipeline components --> <map:convert type="http/any2response" input-format="any" output-format="/text/http/response"/> <map:consume type="http/response" input-format="/text/http/response"/> </map:pipeline> </map:protocol> <map:protocol name="smtp"> <!-- maps content of the mail header "Cocoon-Pipeline" directly to all pipelines --> <map:map type="header" from="Cocoon-Pipeline: **" to="post/**"/> <map:pipeline type="post"> <!-- The components of this pipeline are executed after the sitemap pipeline components --> <map:convert type="smtp/any2post" input-format="any" output-format="/text/smtp"/> <map:consume type="smtp" input-format="/text/smtp"/> </map:pipeline> </map:protocol> <map:protocol name="pop3"> <!-- maps content of the mail header "Cocoon-Pipeline" directly to all pipelines --> <map:map type="header" from="Cocoon-Pipeline: **" to="**"/> <map:pipeline type="deliver"> <!-- The components of this pipeline are executed before the sitemap pipeline components --> <map:produce type="pop3" output-format="/text/pop[3]"/> <map:convert type="pop3/pop2any" input-format="/text/pop[3]" output-format="any"/> </map:pipeline> </map:protocol> <map:protocol name="ftp"> <!-- maps the upload of a file under /home/ftp-user/upload/ to the pipelines starting with "upload/" --> <map:map type="put" from="/home/ftp-user/upload/**" to="upload/**"/> <!-- maps the download of a file under /home/ftp-user/ directly to all pipelines --> <map:map type="get" from="/home/ftp-user/**" to="**"/> <map:pipeline type="put"> <!-- The components of this pipeline are executed before the sitemap pipeline components --> <map:produce type="ftp-put" output-format="/text/ftp/put"/> </map:pipeline> <map:pipeline type="get"> <!-- The components of this pipeline are executed before the sitemap pipeline components --> <map:consume type="ftp-get" input-format="/text/ftp/get"/> </map:pipeline> </map:protocol> </map:mappings> The only thing I don't like here is to use <map:map/> because I'm sure that this will cause misunderstandings. I'd suggest to use an other namespace prefix. 9 Pipelines as Pipeline Components ================================== Based on the assumptions taken so far we can define rules for pipelines, which implicitly make them to pipeline components themselves: 9.1 Producer Pipelines ---------------------- Pipelines which produce data and don't consume anything are called producer pipelines. The following example produces data in the format "/text/xml", but does not consume any data, so it must have a producer component at the beginning of the pipeline but no consumer at the end. Example: <map:pipeline output-format="/text/xml"> <map:match pattern="producer-pipeline"> <map:produce ... /> ... </map:match> </map:pipeline> You can use this pipeline as a producer in other pipelines by writing: <map:produce ref="cocoon:/producer-pipeline"/> 9.2 Consumer Pipelines ---------------------- Pipelines which consume data and don't produce data are called consumer pipelines. The following example consumes data in the format "/text/xml", but does not produce any data, so it must have a consumer component at the end of the pipeline but no producer at the beginning. Example: <map:pipeline input-format="/text/xml"> <map:match pattern="consumer-pipeline"> ... <map:consume ... /> </map:match> </map:pipeline> You can use this pipeline as a consumer in other pipelines by writing: <map:consume ref="cocoon:/consumer-pipeline"/> 9.3 Converter Pipelines ----------------------- Pipelines which consume a certain data format and produce a certain (different) data format are called converter pipelines. The following example converts data from the format "/text/xml/xhtml" to "/text/sgml/html", so it neither has a producer at the beginning of the pipeline nor a consumer at the end of the pipeline. Example: <map:pipeline input-format="/text/xml/xhtml" output-format="/text/sgml/html"> <map:match pattern="converter-pipeline"> ... </map:match> </map:pipeline> You can use this pipeline as a converter in other pipelines by writing: <map:convert ref="cocoon:/consumer-pipeline"/> 9.4 Filter Pipelines -------------------- Pipelines which consume a certain data format and produce a the same (or a compatible) data format are called converter pipelines. The following example filters data with the format "/text/xml", so it neither has a producer at the beginning of the pipeline nor a consumer at the end of the pipeline. Example: <map:pipeline input-format="/text/xml" output-format="/text/xml"> <map:match pattern="filter-pipeline"> ... </map:match> </map:pipeline> You can use this pipeline as a filter in other pipelines by writing: <map:filter ref="cocoon:/filter-pipeline"/> 9.5 Action Pipelines -------------------- Pipelines which neither consume nor produce data are called action pipelines. They can produce data internally through a producer and consume it again with a consumer, but no data from outside of the pipeline is flowing in or out. Example: <map:pipeline> <map:match pattern="action-pipeline"> <map:produce ... /> ... <map:consume ... /> </map:match> </map:pipeline> You can use this pipeline as an action in other pipelines by writing: <map:act ref="cocoon:/action-pipeline"/> 10 Configuration Files ====================== With so many new sitemap declarations it is hard to keep the sitemap managable. To solve this problem I'd suggest to split it up in different files, which all deal with separate concerns. 10.1 cocoon.xconf ----------------- This configuration file has the same functionality like in current cocoon versions. It's main purpose is to register and configure avalon components. 10.2 components.xconf --------------------- In this file all the pipeline components are defined (see section "6 Pipeline Components"). It uses it's own namespace (e.g. http://apache.org/cocoon/component/1.0). 10.3 protocols.xconf -------------------- In this file all the protocols are defined (see section "8 Protocol Handler"). It uses it's own namespace (e.g. http://apache.org/cocoon/protocol/1.0). 10.4 bindings.xconf ------------------- In this file all the protocol port bindings are defined (see section "8 Protocol Handler"). It uses it's own namespace (e.g. http://apache.org/cocoon/binding/1.0). 10.5 protocol-mappings.xconf ---------------------------- In this file the mapping to sitemap pipelines are defined (see section "8 Protocol Handler"). It uses it's own namespace (e.g. http://apache.org/cocoon/mapping/1.0). 10.6 data-formats.xconf ---------------------- In this file all the data formats are defined (see section "5 Data Formats"). It uses it's own namespace (e.g. http://apache.org/cocoon/format/1.0). 10.7 sitemap.xmap ----------------- This file holds all the pipelines (see section "6 Pipeline Components"). It uses it's own namespace (e.g. http://apache.org/cocoon/sitemap/3.0). To be more flexible the content of the configuration files can be placed inside the sitemap. This will make it easier for small sitemaps. For large sitemaps I'd suggest to use references to those files instead, to keep the configuration managable. This way you can even share the same files for different sitemaps just by referencing the same config file. Here's a rough sketch of the structure from sitemap.xmap: <map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/3.0"> ... <map:components> <!-- optional: ref="components.xconf" --> <map:protocols ref="protocols.xconf"/> <map:bindings ref="bindings.xconf" /> <map:formats ref="formats.xconf" /> <map:mappings ref="mappings.xconf" /> <map:producers ... /> <map:consumers ... /> <map:converters ... /> <map:filters ... /> <map:exceptions ... /> </map:components> ... </map:sitemap> All sub elements of <map:components> can place their configuration directly as sub elements inside the sitemap or can be swapped out to external files which are referenced by the ref="..." attribute. I'm still unsure if we should really place everything below <map:components>, since there are some configurations involved which don't specify new components (e.g. bindings and mappings). Perhaps we can find a more meaningful element name or split it up into different sections. Let's see what some discussion on this topic will bring us ... 10.8 Config File Hierarchy -------------------------- Here's an overview on the hierarchy of the config file as it looks for now: cocoon.xconf (references the main sitemap.xmap with the treeprocessor declaration) | +-sitemap.xmap | +-components.xconf | +-protocols.xconf | +-bindings.xconf | +-mappings.xconf | +-formats.xconf | +-producers.xconf | +-consumers.xconf | +-converters.xconf | +-filters.xconf | +-exceptions.xconf 11 Converting old sitemaps to new sitemaps ========================================== Some of you might be interested, if this new concept is flexible enough to provide at least the same functionality as Cocoon does today. I'll give you some examples, about how old pipeline components can be translated to the new pipeline components. The most important thing to remember is, that all of the old pipeline components (except the reader) work with the data format "/text/xml" or derived formats. So theoretically the old implementation of the new components does not differ very much from their new implementation. 11.1 Generators --------------- This is simply a producer which takes no input data and produces the output-format "/text/xml". Here's an example: <map:generate type="file" src="doc/{1}.xml"/> Maps to: <map:produce type="uri" ref="doc/{1}.xml" output-format="/text/xml"/> You can also think of an XMLProducer, where the output-format is implicitly set to "/text/xml", so you don't have to provide it every time you use the producer. Of course this applys to all other components too. 11.2 Transformers ----------------- They simply consume XML and produce XML, so they are actually filters. Here's an example: <map:transform type="xslt" src="stylesheets/news2xhtml.xsl"/> Maps to: <map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl"/> Since filters don't change the data format, you don't need to specify the input- and output-format, because they are either specified implicitly in the component definition, or default to the input/output-format of the surrounding pipeline components. 11.3 Readers ------------ They simply read a file and deliver it, so they are actually producers. Here's an example: <map:read src="welcome/cocoon.gif" mime-type="image/gif"/> Maps to: <map:produce ref="welcome/cocoon.gif" output-format="/binary/gif"/> NOTE 1: The MIME type is implicitly contained in every data format. So the output-format "/binary/gif" results in the MIME type "image/gif". NOTE 2: There's one difference between the reader and the producer concerning the delivering of resources. The reader actually delivered them after reading, which is not the case with the producer. This is actually done automatically by the protocol handler which appends certain (configurable) pipeline components to consumer pipelines (see section "8 Protocol Handler"). 11.4 Serializers ---------------- They definitely convert XML to an other format and therefore behave like converters. Here's an example: <map:serialize type="svg2png" mime-type="image/png"/> Maps to: <map:convert type="svg2png" input-format="/text/xml/svg" output-format="/binary/png"/> The other tasks of a serializer, like preparing the response of the pipeline (HTTP headers, mime-type, ...), is done by the respective protocol handlers, which for example append the following components to the end of the consumer pipeline (see section "8 Protocol Handler"): <map:convert type="http/any2response" input-format="any" output-format="/text/http/response"/> <map:consume type="http/response" input-format="/text/http/response"/> 11.5 Selectors -------------- The functionality of <map:select>...</map:select> is fully supported by the more flexible <map:branch>...</map:branch> concept and can be easily converted. Here's an example: <map:select type="browser"> <map:when test="wap"> ... </map:when> <map:when test="netscape"> ... </map:when> <map:otherwise> ... </map:otherwise> </map:select> Maps to: <map:branch type="browser"> <map:case match="wap"> ... </map:case> <map:case match="netscape"> ... </map:case> <map:default> ... </map:default> </map:branch> 12 Use Cases ============ This section gives you some examples which show you the possibilities of this proposed architecture. NOTE: For better understanding I've included the input/output-format attributes to some of the pipeline components which makes them easier to understand. Keep in mind, that you don't need to specify them every time. Usually you'll only define them once per component in the components section or they are implicitely set by surrounding components or the pipeline itself. 12.1 File Upload ---------------- This example uploads a HTML news file, extracts xml content and stores it in an XML database. <map:pipeline input-format="/text/sgml/html"> <map:match pattern="upload/news/*.html"> <map:convert type="html2xhtml" output-format="/text/xml/xhtml"/> <map:filter type="xml/xslt" ref="xhtml2news.xsl" output-format="/text/xml/newsml"/> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/> </map:match> </map:pipeline> 12.2 Combining several pipelines -------------------------------- In this example we are combining 3 pipelines: 1. This one generates data in a certain format: <map:pipeline output-format="/text/sgml/html"> <map:match pattern="news/*.html"> <map:produce type="uri" ref="documents/news/{1}.xml" output-format="/text/xml/newsml"/> <map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl" output-format="/text/xml/xhtml"/> <map:convert type="xhtml2html" output-format="/text/sgml/html"/> </map:match> </map:pipeline> 2. This one consumes data in a certain format: <map:pipeline input-format="/text/xml/xhtml"> <map:match pattern="upload/news/*.html"> <map:convert type="html2xhtml" output-format="/text/xml/xhtml"/> <map:filter type="xml/xslt" ref="xhtml2news.xsl" output-format="/text/xml/newsml"/> <map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/> </map:match> </map:pipeline> 3. This one references both pipelines and combines them into a new one: <map:pipeline> <map:match pattern="replicate/news/*.html"> <map:produce type="uri" ref="cocoon:/news/{1}.html"/> <map:consume type="uri" ref="cocoon:/upload/news/{1}.html"/> </map:match> </map:pipeline> 12.3 Unix Pipes --------------- This is a universal filter pipeline, which counts the number of lines of text data flowing through the pipeline. The optional argument can be used to grep each line. <map:pipeline input-format="/text" output-format="/text"> <map:match pattern="filter/count/lines/**"> <map:filter type="text/grep"> <!-- unix grep (regular expression filter) --> <map:parameter name="pattern" value="{1}"/> </map:filter> <map:filter type="text/wc"> <!-- unix wc (word count) --> <map:parameter name="mode" value="linecount"/> </map:filter> </map:match> </map:pipeline> This pipeline uses the filter from above to analyze Apache's access_log for certain requests: <map:pipeline output-format="/text"> <map:match pattern="statistics/forms/*"> <map:produce ref="file:///var/log/httpd/access_log"/> <!-- like unix cat (list file contents) --> <map:filter ref="cocoon:/filter/count/lines/forms/login.html"/> <!-- unix grep (regular expression filter) --> <!-- Result is the number of requests to the file /forms/login.html in the Apache access log --> </map:match> </map:pipeline> 12.4 Image Processing --------------------- This pipeline takes several image formats and converts them to the abstract image format, which can be used by format-independent image filters: <!-- Since we don't know the concrete image format for the input we have to use 'any' --> <map:pipeline input-format="any" output-format="/abstract/image"> <map:match pattern="convert/to-image/*.*"> <map:branch test="{2}"> <map:case match="jpg|jpeg|JPG|JPEG"> <map:convert type="jpg2image" input-format="/binary/jpeg"/> </map:case> <map:case match="gif|GIF"> <map:convert type="gif2image" input-format="/binary/gif"/> </map:case> <map:default> <map:throw type="input-format" message="{2} is not a supported input image type."/> </map:default> </map:branch> </map:match> </map:pipeline> This pipeline takes the abstract image format and converts it to certain specific image formats: <!-- Since we don't know the concrete image format for the output we have to use 'any' --> <map:pipeline input-format="/abstract/image" output-format="any"> <map:match pattern="convert/from-image/*.*"> <map:branch test="{2}"> <map:case match="jpg|jpeg|JPG|JPEG"> <map:convert type="image2jpg" output-format="/binary/jpeg"/> </map:case> <map:case match="gif|GIF"> <map:convert type="image2gif" output-format="/binary/gif"/> </map:case> <map:default> <map:throw type="output-format" message="{2} is not a supported output image type."/> </map:default> </map:branch> </map:match> </map:pipeline> This is an example for an abstract image filter pipeline, which is independent from the specific image data format. It prepares an image for character recognition: <map:pipeline input-format="/abstract/image" output-format="/abstract/image"> <map:match pattern="filter/image/prepare-ocr"> <map:filter type="image/histogram"> <map:parameter name="equalize" value="full"/> </map:filter> <map:filter type="image/2greyscale" /> <map:filter type="image/2bw"> <map:parameter name="method" value="threshold"/> <map:parameter name="level" value="0.5"/> </map:filter> </map:match> </map:pipeline> This pipeline invokes the pipelines from above and shows how these pipelines can be reused as pipeline components themselfes: <!-- Since we don't know the image format we have to use 'any' as input and output format --> <map:pipeline input-format="any" output-format="any"> <map:match pattern="filter/any-image/prepare-ocr/*"> <map:convert ref="cocoon:/convert/to-image/{1}"/> <map:filter ref="cocoon:/filter/image/prepare-ocr"/> <map:convert ref="cocoon:/convert/from-image/{1}"/> <!-- Since the output format of the converter above is a certain image data format, it overrides the default for this pipeline (any). --> </map:match> </map:pipeline> 12.5 PDF decompiling -------------------- This pipeline decompiles a PDF document into an intermediate XML format (see [4]), transforms it to a custom XML format (extract data) and stores it to an XML database. Depending on the success state different, the client gets redirected to different response pages. <map:pipeline input-format="/binary/pdf"> <map:match pattern="import/*.pdf"> <map:convert type="pdf2xml" output-format="/text/xml/pdf-xml"/> <!-- Here we have an intermediate XML stream --> <map:filter type="xml/xslt" ref="stylesheets/pdfxml2docxml.xsl"/> <!-- Here we have an XML stream with the extracted information --> <map:consume type="uri" dest="xmldb:xindice://localhost:4080/db/news/{1}.xml"/> <map:branch type="consume/status"> <map:when test="success"> <map:redirect-to uri="success-page"/> </map:when> <map:default> <map:redirect-to uri="error-page"/> </map:default> </map:branch> </map:match> </map:pipeline> 12.6 Music Processing --------------------- This pipeline generates a printable music score from a MIDI file (without XML): <map:pipeline input-format="/binary/midi" output-format="/binary/pdf"> <map:match pattern="convert/midi2pdf/*"> <map:convert type="midi2musitex" output-format="/text/tex/musixtex"/> <map:convert type="tex2dvi" input-format="/text/tex" output-format="/binary/dvi"/> <map:convert type="dvi2pdf" output-format="/binary/pdf"/> </map:match> </map:pipeline> The next pipeline uses MidiXML, an XML format which part of MusicXML and is available for representing music data (see [5] and [6]). It converts the binary MIDI format to MidiXML, selects the keyboard channel, transposes it 5 pitches up and converts it back to the midi format. <map:pipeline input-format="/binary/midi" output-format="/binary/midi"> <map:match pattern="filter/custom/*"> <map:convert type="midi2xml" output-format="/text/xml/midixml"/> <map:filter type="midixml/select-channel"> <map:parameter name="name" value="keyboard"/> </map:filter> <map:filter type="midixml/transpose"> <map:parameter name="value" value="+5"/> </map:filter> <map:convert type="xml2midi" output-format="/binary/midi"/> </map:match> </map:pipeline> 13 Conclusion ============= You might ask, why should we change so much from Cocoon? First I think the new components are much more flexible and at least as easy to understand as the old ones: If you want to produce a data stream you use a producer, if you want to consume it you use a consumer, if you want to convert it you use a converter and if you want to filter it you use a filter. To control the data flow you can use the <map:branch/> component. A possible migration path could be to support both sitemap versions, since the pipeline components either have different names or provide the same functionality. So a new sitemap implementation could be backward compatible to older sitemap versions. This could make the transition for the user as easy as possible. Additionally it might be possible to provida a migration script (e.g. via XSL) which reads an old sitemap and converts it to the new format. Since everything from the old sitemap can be expressed in the new sitemap and can be formally translated (see section "11 Converting old sitemaps to new sitemaps") this should not be a big issue. 14 TODO ======= 1. Which concrete role do the data handlers play? Do we need an input and output data handler or just one? Do we need data handlers at all? 2. Define and manage a list of data formats (central internet repository?) Perhaps it's possible to coordinate the work for MIME types and data formats. 3. The number of components possibly explodes very fast. Therefore we should take care to design good package structures and namespaces to overcome this problem. 4. The protocol handlers have to be worked out more precisely. 5. The parameters of data format actually reflect its meta data. Support for RDF/OWL (see [7] and [8]) would definitely make sense to get one step further to the semantic web. 15 References ============= [1] [RT] Input Pipelines (long) (thread on cocoon-dev initiated by Daniel Fangerstrom on Dec 17th 2002) http://www.mail-archive.com/cocoon-dev@xml.apache.org/msg25503.html [2] MIME Media Types http://www.iana.org/assignments/media-types/ [3] XML Schema Datatypes http://www.w3.org/TR/xmlschema-2/ [4] JPedal Open Source library written in Java which can extract data from PDF documents. http://www.jpedal.org/ [5] MusicXML http://www.recordare.com/xml.html [6] XEMO http://www.xemo.org/ Project XEMO is an open source, modular software environment for the development and delivery of interactive music, audio and sound applications. It is written in Java and supports MusicXML. [7] RDF - Resource Description Framework http://www.w3.org/RDF/ [8] OWL - Web Ontology Language based on RDF http://www.w3.org/TR/2002/WD-owl-ref-20021112/ 16 Appendix =========== This new architecture opens up a whole new way of flexibility and integration of data processing which has been already possible for XML processing. Here I'd give you an idea of some further data formats and components and I'm sure you can think of even more. Remember: Only your mind is the limit ;-) 16.1 Data Formats ----------------- In this section you can find a proposed list of data formats, which gives an overview about how they could be structured. No data format: - none (used, if nothing is produced/consumed) Super data format: - any (base data format for abstract, binary and text) Abstract data formats (used by components which are independent from concrete file format): - /abstract/image - /abstract/music - /abstract/sound - /abstract/vector - /abstract/vector/3d - /abstract/video Binary data formats: - /binary - /binary/au - /binary/avi - /binary/avi/indeo - /binary/avi/indeo[4.1] - /binary/avi/indeo[5.0] - /binary/avi/divx - /binary/bmp - /binary/bmp/os2 - /binary/bmp/windows - /binary/elf - /binary/elf/executable - /binary/elf/shared - /binary/gif - /binary/gif[87a] - /binary/gif[89a] - /binary/mp3 - /binary/mpeg - /binary/ogg - /binary/ole - /binary/ole/msexcel - /binary/ole/mspowerpoint - /binary/ole/msword - /binary/tiff - /binary/tiff/jpeg - /binary/tiff/lzw - /binary/tiff/packbits - /binary/tiff/zip - /binary/wav - /binary/... Text data formats: - /text - /text/http - /text/http/request - /text/http/request[0.9] - /text/http/request[1.0] - /text/http/request[1.1] - /text/http/response - /text/http/response[0.9] - /text/http/response[1.0] - /text/http/response[1.1] - /text/sgml - /text/sgml/docbook - /text/sgml/docbook/simple - /text/sgml/html - /text/sgml/html[3.0] - /text/sgml/html[4.0] - /text/sgml/html[4.1] - /text/sgml/html/frameset - /text/sgml/html/strict - /text/sgml/html/transitional - /text/tex - /text/tex/latex - /text/tex/musixtex - /text/xml - /text/xml/docbook - /text/xml/docbook/simple - /text/xml/rdf - /text/xml/rdf/rss - /text/xml/svg - /text/xml/xhtml - /text/xml/xhtml[1.0] - /text/xml/xhtml[1.1] - /text/... 16.2 Pipeline Components ------------------------ Image Processing (bringing Photoshop to Cocoon ;-): - BlurFilter - AquarellFilter - NoiseFilter - SharpenFilter - ExtrudeFilter - ReliefFilter - HistogramFilter - ... Sound Processing (bringing Arts/SOX/Cubase to Cocoon ;-): - EqualizerFilter - DistortionFilter - ChorusFilter - DelayFilter - FlangerFilter - VolumeFilter - PitchShifterFilter - MixerAggregator - SequenceAggregator - MP32SoundConverter - Sound2MP3Converter - Ogg2SoundConverter - Sound2OggConverter - ... Video Processing (bringing Premiere to Cocoon ;-): - BlendingAggregator - MixerAggregator - EffectsFilter - AVI2VideoConverter - Video2AVIConverter - MPG2VideoConverter - Video2MPGConverter - ... For video processing it would be nice to be able to process the audio part of the video with sound processing components and the image part of the video with the image processing components (maximum component reuse!). This demands that the abstract video data format is composed of the abstract sound format and a sequence of abstract image formats which is done by extending both /abstract/image and /abstract/sound formats in the declaration of /abstract/video (see section "5 Data Formats"). Vector Graphics Processing (bringing Corel Draw/Illustrator to Cocoon ;-): - BooleanFilter (Union, intersection, ...) - TranslationFilter (Move, rotate, resize, ...) - VectorAggregator (Aggregate different vector graphics) - SVG2VectorConverter - Vector2SVGConverter - WMF2VectorConverter - Vector2WMFConverter - CDR2VectorConverter - Vector2CDRConverter - ... Music Processing (bringing Arts/Cubase/Capella/Sibelius/Finale to Cocoon ;-): - PitchShifterFilter - Midi2MusicConverter - Music2MidiConverter - Music2ImageConverter (render music score for printing) - Image2MusicConverter (you know Capella Scan?) - Music2SoundConverter (render music to synthesized sound) - Sound2MusicConverter (extract music data from sound data) - ... 3D Graphics Processing (bringing 3D Studio/POV-Ray to Cocoon ;-): - TranslationFilter (Move, rotate, resize, ...) - 3DAggregator (Aggregate different 3D graphics) - ParticleFilter - ExplosionFilter - 3DS23DConverter - 3D23DSConverter - DXF23DConverter - 3D2DXFConverter - POV23DConverter - 3D2POVConverter - 3D2ImageConverter (render an image) - 3D2VideoConverter (render an animated scene) - ... --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]