[RT] Input Pipelines (long)

Daniel Fagerstrom Mon, 16 Dec 2002 16:45:36 -0800

Input Pipelines
===============

There is, IMO, a need for better support for input handling in
Cocoon. I believe that the introduction of "input pipelines" can be an
important step in this direction. In the rest of this (long) RT I will
discuss use cases for them, a possible definition of input pipelines,
compare them with the existing pipeline concept in Cocoon (henceforth
called output pipelines), discuss what kind of components that would
be useful in them, how they can be used in the sitemap and from
flowscripts, and also relate them to the current discussion about how
to reuse functionality "Cocoon services" between blocks.


Use cases
---------

There is an ongoing trend of packaging all kinds of application as web
applications or to decompose them as sets of web services. At the same
time web browsers are more and more becoming a universal GUI for all
kinds of applications (e.g. XUL).

This leads to an increasing need for handling of structured input data
in web applications. SOAP might be the most important example, we also
have XML-RPC and most certainly numerous home brewn formats, some might
even be binary non-xml legacy formats. WebDAV is another example of
xml-input, and next generation form handling, XForms, use xml as
transport format.

As people are building more and more advanced Cocoon-systems there is
also a growing need for reusing functionality in a structured way,
there have been discussions about how to package and reuse "Cocoon
services" in the context of blocks [1] and [2]. Here there is also a
need for handling xml-input.

The company I work for build data warehouses, some of our customer are
starting to get interested in using the functionality of the data
warehouses, not only from the the web interfaces that we usually build
but also as parts of their own webapps. This means that we want,
besides Cocoons flexibility in presenting data in different forms,
also flexibility in asking for the data through different input
formats.

There is thus a world of input beyond the request parameters, and a
world of rapidly growing importance.

Does Cocoon support the abovementioned use cases? Yes and no: there
are numerous components that implements SOAP, WebDAV, parts of XForms
etc. But while the components designed for publishing are highly
reusable in various context, this is not the case for input
components. IMO the reason for this is that Cocoon as a framework does
not have much support for input handling.

IMO Cocoon could be as good in handling input as it currently is in
creating output, by reusing exactly the same concept: pipelines. We
can however not use the existing "output pipelines" as is, there are
some assymetries in their design that makes them unsuitable for input.

The term "input pipeline" has sometimes been used on the list, it is
time to try to define what it could be.

What is an Input Pipeline
-------------------------

An input pipeline typically starts by reading octet data from the
input stream of the request object. The input data could be xml, tab
separated data, text that is structured according to a certain
grammar, binary legacy formats like Excel or Word or anything else
that could be translated to xml. The first step in the input pipeline
is an adapter from octet data to a sax events. This sounds quite
similar to a generator, we will return to this in the next session.

The structure of the xml from the first step in the pipeline might not
be in a form that is suitable for the data model that we would like to
use internally in the system. Reasons for this can be that the xml
input is supposed to follow some standard or some customer defined
format. Input adapters for legacy formats will probably produce xml
that is similar to the input format and repeat all kinds of
idiosyncrasies from that format. There is thus a need to transform the
input xml to an xml format more suited to our application specific
needs. One or several xslt-transformer steps would therefore be
useful in the input pipeline.

As a last step in the input pipeline the sax events should be adapted
to some binary format so that e.g. the business logic in the system
can be applied to it. The xml input could e.g. be serialized to an
octet stream for storage in a file (as text, xml, pdf, images, ...),
transformed to java objects for storage in the session object, be put
into an xml db or into an relational db.

Isn't this exactly what an output pipeline does?

Comparison to Output Pipelines
------------------------------

Booth an input and an output pipeline consists of a an adaptor from
a binary format to sax events followed by a (possibly empty) sequence
of transformers that take sax events as input as well as output. The
last step is an adaptor from sax events to a binary format. The main
difference (and the one I will focus on) is how the binary input and
output is connected to the pipeline.

Let us look at an example of an output pipeline:

<match pattern="*.html"/>
  <generate type="xml" src="{1}.xml"/>
  <transform type="xsl" src="foo.xsl"/>
  <serialize type="html"/>
</match>

The input to the pipeline is controlled from the sitemap by the src
attribute in the generator, while the output from the serializer can't
be controlled from the sitemap, the context in which the sitemap is
used is responsible for directing the output to an appropriate
place. If the pipeline is used from a servlet, the output will be
directed to the output stream of the response object in the serlet. If
it is used from the command line, the output will be redirected to a
file. If it is used in the cocoon: protocol the output will be
redirected to be used as input from the src attribute of e.g. a
generator or a transformer (cf. with Carstens and mine writings in
[1] about the semantics of the cocoon: protocol).

Here is another example:

<match pattern="bar.pdf"/>
  <generate type="xsp" src="bar.xsp"/>
  <transform type="xsl" src="foo.xsl"/>
  <serialize type="pdf"/>
</match>

In this case the binary input is taken from the object model and the
component manager in Cocoon and the input file to the generator,
"bar.xsp" describes how to extract the input and how to structure it
as an xml document.

If we compare a Cocoon output pipeline with a unix pipeline, it always
ignore standard input and always write to standard output. An input
pipeline would be the opposite: it would always read from standard
input and ignore standard output. In Cocoon this would mean that the
input source would be set by the context. In a servlet, input would be
taken from the input stream of the request object. We could also have
a writable cocoon: protocol where the input stream would be set by the
user of the protocol, more about that later, (see also my post in the
thread [1]).

An example:

<match pattern="**.xls"/>
  <generate type="xls"/>
  <transform type="xsl" src="foo.xsl"/>
  <serialize type="xml" dest="context://repository/{1}.xml"/>
</match>

Here the generator reads an Excel document from the input stream that
is submitted by the context, and translate it to some xml format. The
serializer write its xml input in the file system. I reused the names
generator and serializer partly because I didn't found any good names
(deserializer is the inverse to serializer, but what is the inverse of
a generator?), and partly because it IMO would be the best solution if
the generator and serializer from output pipelines can be extended to
be usable in input pipelines as well. Several of the existing
generators would be highly usable in input pipelines if they were
modified in such a way that they read from "standard input" when no
src attribute is given. There are also some serializers that would be
usefull in the input pipelines as well, in this case the output stream
given i the dest attribute should be used instead of the one that is
supplied by the context. It can of course be problematic to extend the
definition of generators anda serializers as it might lead to back
compabillity problems.

Another example of an input pipeline:

<match pattern="in"/>
  <generate type="textparser">
    <parameter name="grammar" value="example.txt"/>
  </generate>
  <transform type="xsl" src="foo.xsl"/>
  <serialize type="xsp" src="toSql.xsp"/>
</match>

In this example the serializer modify the content of components that
can be found from the object model and the component manager. We use a
hypothetical "output xsp" language to describe how to modify the
environment. Such a language could be a little bit like xslt in the
sense that it recursively applies templates (rules) with matching
xpath patterns. But the template would contain custom tags that have
side effects instead of just emitting xml. Could such a language be
implemented in Jelly? It would be useful with custom tags that modify
the session object, that writes to sql databases, connect with business
logic and so on.

Error Handling
--------------

Error handling in input pipelines is even more important than in
output pipelines: We must protect the system against non well formed
input and the user must be given detailed enough information about
whats wrong, while they in many cases has no access to log files or
access to the internals of the system.

Examples of things that can go wrong is that the input not is parsable
or that it isn't valid with respect to some grammar or scheme. If we
want input pipelines to work in streaming mode, without unnecessary
buffering, it is impossible to know that the input data is correct until all
of it is processed. This means that serializer might already have
stored some parts of the pipeline data when an error is detected. I
think that serializers where faulty input data would be unacceptable,
should use some kind of transactions and that they should be notified
when something goes wrong earlier in the pipeline so that they are
able to roll back the transaction.

I have not studied the error handling system in Cocoon, maybe there
already are mechanisms that could be used in input pipelines as well?

In Sitemaps
-----------

In a sitemap an input pipeline could be used e.g. for implementing a
web service:

<match pattern="myservice">
  <generate type="xml">
    <parameter name="scheme" value="myInputFormat.scm"/>
  </generate>
  <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
  <serialize type="dom-session" non-terminating="true">
    <parameter name="dom-name" value="input"/>
  </serialize>
  <select type="pipeline-state">
    <when test="success">
      <act type="my-business-logic"/>
      <generate type="xsp" src="collectTheResult.xsp"/>
      <serialize type="xml"/>
    </when>
    <when test="non-valid">
      <!-- produce an error document -->
    </when>
  </select>
</match>

Here we have first an input pipeline that reads and validates xml
input, transforms it to some appropriate format and store the result
as a dom-tree in a session attribute. A serializer normally means that
the pipeline should be executed and thereafter an exit from the
sitemap. I used the attribute non-terminating="true", to mark that
the input pipeline should be executed but that there is more to do in
the sitemap afterwards.

After the input pipeline there is a selector that select the output
pipeline depending of if the input pipeline succeed or not. This use
of selection have some relation to the discussion about pipe-aware
selection (see [3] and the references therein). It would solve at
least my main use cases for pipe-aware selection, without having its
drawbacks: Stefano considered pipe-aware selection mix of concern,
selection should be based on meta data (pipeline state) rather than on
data (pipeline content). There were also some people who didn't like
my use of buffering of all input to the pipe-aware selector. IMO the
use of selectors above solves booth of these issues.

The output pipeline start with an action that takes care about the
business logic for the application. This is IMHO a more legitimate use
for actions than the current mix of input handling and business logic.

In Flowscripts
--------------

IIRC the discussion and examples of input for flowscripts this far has
mainly dealed with request parameter based input. If we want to use
flowscripts for describing e.g. web service flow, more advanced input
handling is needed. IMO it would be an excelent SOC to use output
pipelines for the presentation of the data used in the system, input
pipelines for going from input to system data, java objects (or some
other programming language) for describing business logic working on
the data within the system, and flowscripts for connecting all this in
an appropriate temporal order.

For Reuseability Between Blocks
-------------------------------

There have been some discussions about how to reuse functionality
between blocks in Cocoon (see the threads [1] and [2] for
background). IMO (cf. my post in the thread [1]), a natural way of
exporting pipeline functionality is by extending the cocoon pseudo
protocol, so that it accepts input as well as produces output. The
protocol should also be extended so that input as well as output can
be any octet stream, not just xml.

If we extend generators so that their input can be set by the
environment (as proposed in the discussion about input pipelines), we
have what is needed for creating a writable cocoon protocol. The web
service example in the section "In Sitemaps" could also be used as an
internal service, exported from a block.

Booth input and output for the extended cocoon protocol can be booth
xml and non-xml, this give us 4 cases:

xml input, xml output: could be used from a "pipeline"-transformer,
the input to the transformer is redirected to the protocol and the
output from the protocol is redirected to the output of the
transformer.

non-xml input, xml output: could be used from a generator.

xml input, non-xml output: could be used from a serializer.

non-xml input, non-xml output: could be used from a reader if the
input is ignored, from a "writer" if the output is ignored and from a
"reader-writer", if booth are used.

Generators that accepts xml should of course also accept sax-events
for efficiency reasons, and serializer that produces xml should of the
same reason also be able to produce sax-events.

Conclusion
----------

The ability to handle structured input (e.g. xml) in a convenient way,
will probably be an important requirement on webapp frameworks in the
near future.

By removing the asymmetry between generators and serializers, by letting
the input of a generator be set by the context and the output of a
serializer be set from the sitemap, Cocoon could IMO be as good in
handling input as it is today in producing output.

This would also make it possible to introduce a writable as well as
readable Cocoon pseudo protocol, that would be a good way to export
functionality from blocks.

There are of course many open questions, e.g. how to implement those
ideas without introducing to much back incompability.

What do you think?

/Daniel Fagerstrom

References
----------

[1] [RT] Using pipeline as sitemap components (long)
http://marc.theaimsgroup.com/?t=103787330400001&r=1&w=2

[2] [RT] reconsidering pipeline semantics
http://marc.theaimsgroup.com/?t=102562575200001&r=2&w=2

[3] [Contribution] Pipe-aware selection
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

[RT] Input Pipelines (long)

Reply via email to