Input Pipelines =============== There is, IMO, a need for better support for input handling in Cocoon. I believe that the introduction of "input pipelines" can be an important step in this direction. In the rest of this (long) RT I will discuss use cases for them, a possible definition of input pipelines, compare them with the existing pipeline concept in Cocoon (henceforth called output pipelines), discuss what kind of components that would be useful in them, how they can be used in the sitemap and from flowscripts, and also relate them to the current discussion about how to reuse functionality "Cocoon services" between blocks.
Use cases --------- There is an ongoing trend of packaging all kinds of application as web applications or to decompose them as sets of web services. At the same time web browsers are more and more becoming a universal GUI for all kinds of applications (e.g. XUL). This leads to an increasing need for handling of structured input data in web applications. SOAP might be the most important example, we also have XML-RPC and most certainly numerous home brewn formats, some might even be binary non-xml legacy formats. WebDAV is another example of xml-input, and next generation form handling, XForms, use xml as transport format. As people are building more and more advanced Cocoon-systems there is also a growing need for reusing functionality in a structured way, there have been discussions about how to package and reuse "Cocoon services" in the context of blocks [1] and [2]. Here there is also a need for handling xml-input. The company I work for build data warehouses, some of our customer are starting to get interested in using the functionality of the data warehouses, not only from the the web interfaces that we usually build but also as parts of their own webapps. This means that we want, besides Cocoons flexibility in presenting data in different forms, also flexibility in asking for the data through different input formats. There is thus a world of input beyond the request parameters, and a world of rapidly growing importance. Does Cocoon support the abovementioned use cases? Yes and no: there are numerous components that implements SOAP, WebDAV, parts of XForms etc. But while the components designed for publishing are highly reusable in various context, this is not the case for input components. IMO the reason for this is that Cocoon as a framework does not have much support for input handling. IMO Cocoon could be as good in handling input as it currently is in creating output, by reusing exactly the same concept: pipelines. We can however not use the existing "output pipelines" as is, there are some assymetries in their design that makes them unsuitable for input. The term "input pipeline" has sometimes been used on the list, it is time to try to define what it could be. What is an Input Pipeline ------------------------- An input pipeline typically starts by reading octet data from the input stream of the request object. The input data could be xml, tab separated data, text that is structured according to a certain grammar, binary legacy formats like Excel or Word or anything else that could be translated to xml. The first step in the input pipeline is an adapter from octet data to a sax events. This sounds quite similar to a generator, we will return to this in the next session. The structure of the xml from the first step in the pipeline might not be in a form that is suitable for the data model that we would like to use internally in the system. Reasons for this can be that the xml input is supposed to follow some standard or some customer defined format. Input adapters for legacy formats will probably produce xml that is similar to the input format and repeat all kinds of idiosyncrasies from that format. There is thus a need to transform the input xml to an xml format more suited to our application specific needs. One or several xslt-transformer steps would therefore be useful in the input pipeline. As a last step in the input pipeline the sax events should be adapted to some binary format so that e.g. the business logic in the system can be applied to it. The xml input could e.g. be serialized to an octet stream for storage in a file (as text, xml, pdf, images, ...), transformed to java objects for storage in the session object, be put into an xml db or into an relational db. Isn't this exactly what an output pipeline does? Comparison to Output Pipelines ------------------------------ Booth an input and an output pipeline consists of a an adaptor from a binary format to sax events followed by a (possibly empty) sequence of transformers that take sax events as input as well as output. The last step is an adaptor from sax events to a binary format. The main difference (and the one I will focus on) is how the binary input and output is connected to the pipeline. Let us look at an example of an output pipeline: <match pattern="*.html"/> <generate type="xml" src="{1}.xml"/> <transform type="xsl" src="foo.xsl"/> <serialize type="html"/> </match> The input to the pipeline is controlled from the sitemap by the src attribute in the generator, while the output from the serializer can't be controlled from the sitemap, the context in which the sitemap is used is responsible for directing the output to an appropriate place. If the pipeline is used from a servlet, the output will be directed to the output stream of the response object in the serlet. If it is used from the command line, the output will be redirected to a file. If it is used in the cocoon: protocol the output will be redirected to be used as input from the src attribute of e.g. a generator or a transformer (cf. with Carstens and mine writings in [1] about the semantics of the cocoon: protocol). Here is another example: <match pattern="bar.pdf"/> <generate type="xsp" src="bar.xsp"/> <transform type="xsl" src="foo.xsl"/> <serialize type="pdf"/> </match> In this case the binary input is taken from the object model and the component manager in Cocoon and the input file to the generator, "bar.xsp" describes how to extract the input and how to structure it as an xml document. If we compare a Cocoon output pipeline with a unix pipeline, it always ignore standard input and always write to standard output. An input pipeline would be the opposite: it would always read from standard input and ignore standard output. In Cocoon this would mean that the input source would be set by the context. In a servlet, input would be taken from the input stream of the request object. We could also have a writable cocoon: protocol where the input stream would be set by the user of the protocol, more about that later, (see also my post in the thread [1]). An example: <match pattern="**.xls"/> <generate type="xls"/> <transform type="xsl" src="foo.xsl"/> <serialize type="xml" dest="context://repository/{1}.xml"/> </match> Here the generator reads an Excel document from the input stream that is submitted by the context, and translate it to some xml format. The serializer write its xml input in the file system. I reused the names generator and serializer partly because I didn't found any good names (deserializer is the inverse to serializer, but what is the inverse of a generator?), and partly because it IMO would be the best solution if the generator and serializer from output pipelines can be extended to be usable in input pipelines as well. Several of the existing generators would be highly usable in input pipelines if they were modified in such a way that they read from "standard input" when no src attribute is given. There are also some serializers that would be usefull in the input pipelines as well, in this case the output stream given i the dest attribute should be used instead of the one that is supplied by the context. It can of course be problematic to extend the definition of generators anda serializers as it might lead to back compabillity problems. Another example of an input pipeline: <match pattern="in"/> <generate type="textparser"> <parameter name="grammar" value="example.txt"/> </generate> <transform type="xsl" src="foo.xsl"/> <serialize type="xsp" src="toSql.xsp"/> </match> In this example the serializer modify the content of components that can be found from the object model and the component manager. We use a hypothetical "output xsp" language to describe how to modify the environment. Such a language could be a little bit like xslt in the sense that it recursively applies templates (rules) with matching xpath patterns. But the template would contain custom tags that have side effects instead of just emitting xml. Could such a language be implemented in Jelly? It would be useful with custom tags that modify the session object, that writes to sql databases, connect with business logic and so on. Error Handling -------------- Error handling in input pipelines is even more important than in output pipelines: We must protect the system against non well formed input and the user must be given detailed enough information about whats wrong, while they in many cases has no access to log files or access to the internals of the system. Examples of things that can go wrong is that the input not is parsable or that it isn't valid with respect to some grammar or scheme. If we want input pipelines to work in streaming mode, without unnecessary buffering, it is impossible to know that the input data is correct until all of it is processed. This means that serializer might already have stored some parts of the pipeline data when an error is detected. I think that serializers where faulty input data would be unacceptable, should use some kind of transactions and that they should be notified when something goes wrong earlier in the pipeline so that they are able to roll back the transaction. I have not studied the error handling system in Cocoon, maybe there already are mechanisms that could be used in input pipelines as well? In Sitemaps ----------- In a sitemap an input pipeline could be used e.g. for implementing a web service: <match pattern="myservice"> <generate type="xml"> <parameter name="scheme" value="myInputFormat.scm"/> </generate> <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/> <serialize type="dom-session" non-terminating="true"> <parameter name="dom-name" value="input"/> </serialize> <select type="pipeline-state"> <when test="success"> <act type="my-business-logic"/> <generate type="xsp" src="collectTheResult.xsp"/> <serialize type="xml"/> </when> <when test="non-valid"> <!-- produce an error document --> </when> </select> </match> Here we have first an input pipeline that reads and validates xml input, transforms it to some appropriate format and store the result as a dom-tree in a session attribute. A serializer normally means that the pipeline should be executed and thereafter an exit from the sitemap. I used the attribute non-terminating="true", to mark that the input pipeline should be executed but that there is more to do in the sitemap afterwards. After the input pipeline there is a selector that select the output pipeline depending of if the input pipeline succeed or not. This use of selection have some relation to the discussion about pipe-aware selection (see [3] and the references therein). It would solve at least my main use cases for pipe-aware selection, without having its drawbacks: Stefano considered pipe-aware selection mix of concern, selection should be based on meta data (pipeline state) rather than on data (pipeline content). There were also some people who didn't like my use of buffering of all input to the pipe-aware selector. IMO the use of selectors above solves booth of these issues. The output pipeline start with an action that takes care about the business logic for the application. This is IMHO a more legitimate use for actions than the current mix of input handling and business logic. In Flowscripts -------------- IIRC the discussion and examples of input for flowscripts this far has mainly dealed with request parameter based input. If we want to use flowscripts for describing e.g. web service flow, more advanced input handling is needed. IMO it would be an excelent SOC to use output pipelines for the presentation of the data used in the system, input pipelines for going from input to system data, java objects (or some other programming language) for describing business logic working on the data within the system, and flowscripts for connecting all this in an appropriate temporal order. For Reuseability Between Blocks ------------------------------- There have been some discussions about how to reuse functionality between blocks in Cocoon (see the threads [1] and [2] for background). IMO (cf. my post in the thread [1]), a natural way of exporting pipeline functionality is by extending the cocoon pseudo protocol, so that it accepts input as well as produces output. The protocol should also be extended so that input as well as output can be any octet stream, not just xml. If we extend generators so that their input can be set by the environment (as proposed in the discussion about input pipelines), we have what is needed for creating a writable cocoon protocol. The web service example in the section "In Sitemaps" could also be used as an internal service, exported from a block. Booth input and output for the extended cocoon protocol can be booth xml and non-xml, this give us 4 cases: xml input, xml output: could be used from a "pipeline"-transformer, the input to the transformer is redirected to the protocol and the output from the protocol is redirected to the output of the transformer. non-xml input, xml output: could be used from a generator. xml input, non-xml output: could be used from a serializer. non-xml input, non-xml output: could be used from a reader if the input is ignored, from a "writer" if the output is ignored and from a "reader-writer", if booth are used. Generators that accepts xml should of course also accept sax-events for efficiency reasons, and serializer that produces xml should of the same reason also be able to produce sax-events. Conclusion ---------- The ability to handle structured input (e.g. xml) in a convenient way, will probably be an important requirement on webapp frameworks in the near future. By removing the asymmetry between generators and serializers, by letting the input of a generator be set by the context and the output of a serializer be set from the sitemap, Cocoon could IMO be as good in handling input as it is today in producing output. This would also make it possible to introduce a writable as well as readable Cocoon pseudo protocol, that would be a good way to export functionality from blocks. There are of course many open questions, e.g. how to implement those ideas without introducing to much back incompability. What do you think? /Daniel Fagerstrom References ---------- [1] [RT] Using pipeline as sitemap components (long) http://marc.theaimsgroup.com/?t=103787330400001&r=1&w=2 [2] [RT] reconsidering pipeline semantics http://marc.theaimsgroup.com/?t=102562575200001&r=2&w=2 [3] [Contribution] Pipe-aware selection http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]