[RT] Input Pipelines: Storage and Selection (was Re: [RT] Input Pipelines(long))

Daniel Fagerstrom Tue, 07 Jan 2003 02:57:01 -0800

Stefano Mazzocchi wrote:
> Hmmm, maybe deep architectural discussions are good during holydays
> seasons... we'll see :)
Not for me, I've been away from computers for a while. But you and Nicola Ken seem to have had an interesting discussion :)

The discussion about input pipelines can be divided in two parts:
1. Improving the handling of the input stream in Cocoon. This is needed for web services, it is also needed for making it possible to implement a writable cocoon:-protocol, something that IMO would be very useful for reusing functionality in Cocoon, especially from blocks.

2. The second part of the proposal is to use two pipelines, executed in sequence, to respond to input in Cocoon. The first pipeline (called input pipeline) is responsible for reading the input and from request parameters or from the input stream, transform it to an appropriate format and store it in e.g. a session parameter, a file or a db. After the input pipeline there is an ordinary (output) pipeline that is responsible for generating the response. The output pipeline is executed after that the execution of the input pipeline is completed, as a consequence actions and selections in the output pipeline can be dependent e.g. on if the handling of input succeeded or not and on the data that was stored by the input pipeline.

Here I will focus on your comments on the second part of the proposal.

> Daniel Fagerstrom wrote:
<snip/>
>> In Sitemaps
>> -----------
>>
>> In a sitemap an input pipeline could be used e.g. for implementing a
>> web service:
>>
>> <match pattern="myservice">
>> <generate type="xml">
>> <parameter name="scheme" value="myInputFormat.scm"/>
>> </generate>
>> <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>> <serialize type="dom-session" non-terminating="true">
>> <parameter name="dom-name" value="input"/>
>> </serialize>
>> <select type="pipeline-state">
>> <when test="success">
>> <act type="my-business-logic"/>
>> <generate type="xsp" src="collectTheResult.xsp"/>
>> <serialize type="xml"/>
>> </when>
>> <when test="non-valid">
>> 
>> </when>
>> </select>
>> </match>
>>
>> Here we have first an input pipeline that reads and validates xml
>> input, transforms it to some appropriate format and store the result
>> as a dom-tree in a session attribute. A serializer normally means that
>> the pipeline should be executed and thereafter an exit from the
>> sitemap. I used the attribute non-terminating="true", to mark that
>> the input pipeline should be executed but that there is more to do in
>> the sitemap afterwards.
>>
>> After the input pipeline there is a selector that select the output
>> pipeline depending of if the input pipeline succeed or not. This use
>> of selection have some relation to the discussion about pipe-aware
>> selection (see [3] and the references therein). It would solve at
>> least my main use cases for pipe-aware selection, without having its
>> drawbacks: Stefano considered pipe-aware selection mix of concern,
>> selection should be based on meta data (pipeline state) rather than on
>> data (pipeline content). There were also some people who didn't like
>> my use of buffering of all input to the pipe-aware selector. IMO the
>> use of selectors above solves booth of these issues.
>>
>> The output pipeline start with an action that takes care about the
>> business logic for the application. This is IMHO a more legitimate use
>> for actions than the current mix of input handling and business logic.
>
>
> Wouldn't the following pipeline achieve the same functionality you want
> without requiring changes to the architecture?
>
> <match pattern="myservice">
> <generate type="payload"/>
> <transform type="validator">
> <parameter name="scheme" value="myInputFormat.scm"/>
> </transform>
> <select type="pipeline-state">
> <when test="valid">
> <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
> <transform type="my-business-logic"/>
> <serialize type="xml"/>
> </when>
> <otherwise>
> 
> </otherwise>
> </select>
> </match>

Yes, it would achieve about the same functionality as I want and it could easily be implemented with the help of the small extensions of the sitemap interpreter that I implemented for pipe aware selection [3].

I think it could be interesting to do a detailed comparison between the differences in our proposals: How the input stream and validation is handled, how the selection based on pipeline state is performed, if storage of the input is done in a serializer or in a transformer, and how the new output is created.

Input Stream
------------

For input stream handling you used

<generate type="payload"/>

Is the payload generator equivalent to the StreamGenerator? Or does it something more, like switching parser depending on mime type for the input stream?

I used

<generate type="xml"/>

The idea is that if no src attribute is given the sitemap interpreter automatically connect the generator to the input stream of the environment (the input stream from the http request in the servlet case, in other cases it is more unclear). This behavior was inspired by the handling of std input in unix pipelines.

Nicola Ken proposed:

<generate type="xml" src="inputstream://"/>

I prefer this solution compared to mine as it doesn't require any change of the sitemap interpreter, I also believe that it it easier to understand as it is more explicit. It also (as Nicola Ken has explained) gives a good SoC, the uri in the src attribute describes where to read the resource from, e.g. input stream, file, cvs, http, ftp, etc and the generator is responsible for how to parse the resource. If we develop a input stream protocol, all the work invested in the existing generators, can immediately be reused in web services.

Validation
----------

Should validation be part of the parsing of input as in:

<generate type="xml">
<parameter name="scheme" value="myInputFormat.scm"/>
</generate>

or should it be a separate transformation step:

<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>

or maybe the responsibility of the protocol as Nicola Ken proposed in one of his posts:

<generate type="xml" src="inputstream:myInputFormat.scm"/>

This is not a question about architecture but rather one about finding "best practices".

I don't think validation should be part of the protocol. It means that the protocol has to take care of the parsing and that would mumble the SoC where the protocol is responsible for locating and delivering the stream and the generator is responsible for parsing it, that Nicola Ken have argued for in his other posts.

Should validation be part of the generator or a transform step? I don't know. If the input not is xml as for the ParserGenerator, I guess that the validation must take place in the generator. If the xml parser validates the input as a part of the parsing it is more practical to let the generator be responsible for validation (IIRC Xerces2 has an internal pipeline structure and performs validation in a transformer like way, so for Xerces2 it would probably be as efficient to do validation in a transformer as in a generator). Otherwise it seem to give better SoC to separate the parsing and the validation step, so that we can have one validation transformer for each scheme language.

In some cases it might be practical to augment the xml document with error information to be able to give more exact user feedback on where the errors are located. For such applications it seem more natural to me to have validation in a transformer.

A question that might have architectural consequences is how the validation step should report validation errors. If the input is not parseable at all there is not much more to do than throwing an exception and letting the ordinary internal error handler report the situation. If some of the elements or attributes in the input has the wrong type we probably want to return more detailed feedback than just the internal error page. Some possible validation error report mechanisms are: storing an error report object in the environment e.g. in the object model, augmenting the xml document with error reporting attributes or elements, throwing an exception object that contains a detailed error description object or a combination of some of these mechanisms.

Mixing data and state information was considered to be a bad practice in the discussion about pipe-aware selection (se references in [3]), that rules out using only augmentation of the xml document as error reporting mechanism. Throwing an exeption would AFAIU lead to difficulties in giving customized error reports. So I believe it would be best to put some kind of state describing object in the environment and possibly combine this whith augmentation of the xml document.

Pipe State Dependent Selection
------------------------------

For selecting response based on if the input document is valid or not you suggest the following:

...
<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>
<select type="pipeline-state">
<when test="valid">
<transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

As I mentioned earlier this could easily be implemented with the "pipe-aware selection" code I submitted in [3]. Let us see how it would work:

The PipelineStateSelector can not be executed at pipeline construction time as for ordinary selectors. The pipeline before the selector including the ValidatorTransformer must have been executed before the selection is performed. This can be implemented by letting the PipelineStateSelector implement a special marker interface, say PipelineStateAware, so that it can have special treatment in the selection part of the sitemap interpreter.

When the sitemap interpreter gets a PipelineStateAware selector it first ends the currently constructed pipeline with a serializer that store its sax input in e.g. a dom-tree and the pipeline is processed and the dom tree thith the cashed result is stored in e.g. the object model. In the next step the selector is executed and it can base its decision on result from the first part of the pipeline. If the ValidationTransformer puts a validation result descriptor in the object model, the PipelineStateSelector can perform tests on this result descriptor. In the last step a new pipeline is constructed where the generator reads from the stored dom tree, and in the example above, the first transformer will be an XSLTransformer.

An alternative and more explicit way to describe the pipeline state dependent selection above, is:

...
<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>
<serialize type="object-model-dom" non-terminating="true">
<parameter name="name" value="validated-input"/>
</serialize>
<select type="pipeline-state">
<when test="valid">
<generate type="object-model-dom">
<parameter name="name" value="validated-input"/>
</generate>
<transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

Here the extensions to the current Cocoon semantics is put in the serializer instead of the selector. The sitemap interpreter treats a non-terminating serializer as ordinary serializer in the sense that it puts the serializer in the end of the current pipeline and executes it. The difference is that it instead of returning to the caller of the sitemap interpreter, it creates a new current pipeline and continue to interpret the component after the serializer, in this case a selector. The sitemap interpreter will also ignore the output stream of the serializer, the serializer is suposed to have side effects. The new current pipeline will then get a ObjectModelDOMGenerator as generator and an XSLTTransformer as its first transformer.

I prefer this construction compared to the more implicit one because it is more obvious what it does and also as it gives more freedom about how to store the user input. Some people seem to prefer to store user input in Java beans, in some applications session parameters might be a better place then the object model.

Pipelines with Side Effects
---------------------------

A common pattern in pipelines that handle input (at least in the application that I write) is that the first half of the pipeline takes care of the input and ends with a transformer that stores the input. The transformer can be e.g. the SQLTransformer (with insert or update statements), the WriteDOMSessionTransformer, the SourceWritingTransformer. These transformers has side effects, they store something, and returns an xml document that tells if it succeeded or not. A conclusion from the threads about pipe aware selection was that sending meta data, like if the operation succeeded or not, in the pipeline is a bad practice and especially that we don't should allow selection based on such content. Given that these transformers basically translate xml input to a binary format and generates an xml output that we are supposed to ignore, it would IMO be more natural to see them as some kind of serializer.

The next half of the pipeline creates the response, here it is less obvious what transformer to use. I normally use an XSLTTransformer and typically ignore its input stream and only create an xml document that is rendered into e.g. html in a sub sequent transformer.

I think that it would be more natural to replace the pattern:

...
<transform type="store something, return state info"/>
<transform type="create a response document, ignore input"/>
...

with

...
<serialize type="store something, put state info in the environment"
non-terminating="true"/>
<generate type="create a response document" src="response document"/>
...

If we give the serializer a destination attribute as well, all the existing serializers could be used for storing input in files etc as well.

...
<serialize type="xml" dest="xmldb://..." non-terminating="true"/>
...

This would give the same SoC that i argued in favour of in the context of input: The serializer is responsible for how to serialize from xml to the binary data format and the destination is responsible for where to store the data.

Conclusion
----------

I am afraid that I put more question than I answer in this RT. Many of them are of "best practice" character, and do not have any architectural consequences, and does not have to be answered right now. There are however some questions that need an answer:

How should pipeline components, like the validation transformer, report state information? Placing some kind of state object in the object model would be one possibility, but I don't know.

We seem to agree about that there is a need for selection in pipelines based on the state of the computation in the pipeline that precedes the selection. Here we have two proposals:

1. Introduce pipeline state aware selectors (e.g. by letting the selector implement a marker interface), and give such selectors special treatment in the sitemap interpreter.

2. Extend the semantics of serializers so that the sitemap interpreter can continue to interpret the sitemap after a serializer, (e.g. by a new non-terminating attribute for serializers).

I prefer the second proposal.

Booth proposals can be implemented with no back compatibility problems at all by requiring the selectors or serializer that need the extended semantics, to implement a special marker interface, and by adding code that reacts on the marker interface in the sitemap interpreter.

To use serializers more generally for storing things, as I propsed above, the Serializer interface would need to extend the SitemapModelComponent interface.

------

What do you think?

Daniel Fagerstrom

<snip/>

[3] [Contribution] Pipe-aware selection
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

[RT] Input Pipelines: Storage and Selection (was Re: [RT] Input Pipelines(long))

Reply via email to