Re: AW: [PROPOSAL] Cocoon Science Fiction

Stefano Mazzocchi Mon, 10 Feb 2003 09:15:37 -0800

Hochsteger Andreas /INFO-MA wrote:

While I understand your concept, I strongly disagree: SAX provides a multidimensional structured data space which is suitable for *any* kind of data structure.
That's interesting.
Do you mean namespaces by multidimensional structured data space?

yes

But I doubt that placing binary or non-XML/SAX Text inside of structured
XML-Tags will solve it all ;-)

of course. but again, I don't think cocoon should try to do everything.

True, maybe not as efficiently as other formats, but removing a fix contract between pipeline components will require a pluggable and metadata-driven parsing/serializatin stage between each component.

I don't see any value of this compared to the current approach of SAX adaptation of external data to the internal model.
Perhaps you misunderstand something here.
I don't want to change the way, Cocoon handles SAX events right now.
It's more about how we could handle non-SAX data streams a bit better.

cocoon does not handle non-SAX data streams (besides readers, but I
don't want to see them turned into pipelines).

I'm sure some of you wanted to be able to build
applications the same way like

Unix shell pipes work. Cocoon was a big step in this
direction, but it was

only applicable for processing XML data.
*only XML* is misleading. *based on SAX* is the sentence. I've never perceived this as a limitation, but as a paradigm shift.
Agreed.
But the real world is not SAX-based and some better way to handle non-SAX
data streams is demanded.

Great, but what does Cocoon have to do with this?

Topologically speaking, the solutions space is rotated, but it's size is not reduced.

There are so many cases where pipeline processing of data (no matter if it is XML, plain
text or binary

data) is done today but we are lacking a generic and
declarative way to unify

these processing steps. Cocoon is best suited for this task
through it's
clean and easy to understand yet powerful pipeline concept.
If you want to create pipelines for genereral data, why use Cocoon? just use the UNIX pipe or use servlet filters or apache 2.0 modules or any type of 'byte-oriented' (thus un-structured data) pipe&filters modules.
This way I loose the great descriptive concept of Cocoon pipelines and the
integration with it.

Integration with what? cocoon has components that are heavily xml
oriented. providing components for others types of data streams will not
make them interoperable since other data streams will have different
realms and different needs.

as far as descriptive concepts, nobody stops from using the same markup
we use in the sitemap to describe your other pipelines in another
framework targetted for other types of data.

If you remove the structure from the pipeline data that flows, Cocoon will no be Cocoon anymore. This is not evolution, is extintion.
Same misunderstanding as above.
As I pointed out in "11 Converting old sitemaps to new sitemaps" the
components dealing with "/text/xml" are not very different from those
available today.
I don't want to remove the structure from the data through the pipeline in
any way.

Good.

4 Pipeline Types
================

I tried to design several pipelines variants but after
thinking a while they

all were still too limited for the way I wanted them to work.

So here's another try by giving some hypotheses first:
1. A pipeline can produce data
2. A pipeline can consume data
3. A pipeline can convert data
4. A pipeline can filter data
5. A pipeline can accept a certain data format as input
6. A pipeline can produce a certain data format as output
7. Pipeline components follow the same hypotheses (1-6)
8. Only pipeline components with compatible data formats
can be arranged next
to each other
Ah, here you hint that you don't want to remove data structured-ness in the pipeline, just want to add *other* data structures besides SAX events.
Yes, that's what I want to do...
Ok, this is worth investigating.
[snip]
5 Data Formats
==============

With "data format" I mean something like XML, plain text,
png, mp3, ...
I'm not yet really sure here, how we should specify data
formats, so I'll try

to start with some requirements:
1. They should be easy to remember and to specify ;-)
2. It should be possible to create derived data formats (->
inheritance)
3. It should be possible to specify additional information
(e.g. MIME type,

DTD/Schema for XML, ...)
4. Pipelines which accept a certain data format as input
can be fed with

derived data formats
5. We should not reinvent standards, which are already
suited for this task
(but I fear, there does not yet exist something suitable)
You are asking for a very abstract parsing grammar. Note, however, that is pretty easy to point to examples where these grammars will have to be so complex that maintaining them would be a nightmare.
I don't think, that this grammar is very complex.
See "5.1 Data Format Definition".
It only consists of <data:format .../> with optional parameters.

that doesn't take into consideration the multidimensionality of the
content that cocoon is going to operate on.

Think of a BNF-like grammar that is able to explain concepts like XML namespacing or HyTime Architectural Forms.

To make it easier for us to begin with the task of defining
data formats,

let's assume, we have three basic data formats called
"abstract", "binary"

and "text". The format "abstract" will be explained later,
but "binary" and
"text" should be clear to everyone.
Binary and text are unstructured data streams. You are falling back.
We don't fall back, since the structuredness is kept for XML.
We only gain the additional possibility to process unstructured data
streams.

No, in your architecture, there is no way to define that a pipeline
outputs formatting objects which contain SVG figures. This is a
drawback, unless you start providing a new datatype for all possible
combinations of namespaces (yuck!)

This is the reason why we do not describe pipelines with their
input/output properties in Cocoon. This was proposed a while ago and
turned down for that multi-dimensional problems.

5.1 Data Format Definition
--------------------------
[snip]
5.3 A word about MIME Types
---------------------------

If you ask me, why don't I use the standardized MIME types
(see [2]) to

specify data formats, I can give you the following reasons:
MIME types fulfill the requirements from above just partly.
They just support

two levels of classification and they are purpose-oriented.
The data formats

I suggest are therefore content-oriented (/text/xml/svg vs.
image/svg-xml).

So both serve different purposes.

I know the importance of supporting the MIME type standard,
and so the

parameter 'mime-type' is part of the super data format
'any' and thus is

available for every other data format too. By specifying a
certain data

format, you always have a MIME type associated, in the
worst case the MIME

type from the super data format 'any'
(application/octet-stream) is used.

From what I see so far, you are describing nothing different (from an architectural point of view) from what we already have.
That's not what I wanted to do.
5.4 Data Handlers
-----------------

I'm not very sure, what the data handlers actually do, but
I can think of

either defining an interface, which must be implemented by
the pipeline

components which operate with a certain data format (do we
need two handlers

here: input-handler and output-handler?) or they are
concrete components

which can be used by the pipeline components to consume or
produce this data
format. I think some discussion on this topic might not be bad.
Here you hit the nerve.

If you plan on having a different interface of data-handling for each data-type (or data-type family), the permutation of components will kill you.
Yes, I was aware of this problem.
That's why I'm very interested to hear your comments ;-)

But what I don't mean here is an interface for each data type.
I rather mean to provide a reusable component which knows how to deal with a
certain data format.
This component can be used from other pipeline components.

This component has a name: parser. Then a parser has to come up with
something, and this something is normally an object model. Then you have
to adapt your object model to some contract that others components will
have to agree upon. Then you'll find out that this object model +
parsing + serialization stages are awefully slow and memory consuming.

But I have not thought about it very much yet.

Sorry, but it shows :)

5.5 Data Format Determination
-----------------------------

In many cases, I've written the input- and output-format
along with the

pipeline components, but it is also possible to specify them in the <map:components/> section or implicitely by implementing a
certain component

interface and therefore omitting it in the pipeline.

Here's a suggested order of data format determination:

1. Input-/output-Format specified directly with a pipeline component
<map:produce type="uri" ref="docs/file.xml"
output-format="/text/xml"/>
2. Input-/output-Format specified by the component declaration
<map:filters>
<map:filter name="prettyxml" input-format="/text/xml" output-format="/text/xml" ... />
</map:filters>
3. Output-/input-Format specified by the previous or
following pipeline

component
<map:produce type="uri" ref="docs/file.xhtml" output-format="/text/xml/xhtml"/>

<map:filter type="prettyxml"/>
4. Input-/output-Format specified directly with a pipeline
<map:pipeline input-format="/text/xml"
output-format="/text/xml">
		<map:filter type="prettyxml"/>
		...
	</map:pipeline>
5. If nothing from above matches then assume "none".
eheh, I wish it was that easy ;-)

Suppose you have a component that operates on the svg: namespace of a SAX stream only, what is the input type?

if data types are monodimensional, the above is feasible, but Cocoon pipelines are *already* multi-dimensional and the above can't possibly work (this has been discussed extensively before for pipeline validation)
You got me!
This is something I didn't think about currently.
Perhaps using only "/text/xml" for such cases, without dealing with derived
XML data formats solves it?

No, you are back with no information on the type rather than "this is
xml", which doesn't mean anything and doesn't contain enough information
to understand how to compose pipelines.

6 Pipeline Components
=====================
[snip]

Assuming you have several structured pipelines:

- SAX -> all xml/sgml content
- output/input streams -> unstructured text/binary
- OLE -> all OLE-based files (word, excel, blah blah)
- MPEG -> all MPEG-based framed multimedia (MPEG1/2, mp3)

why would you want to mix them into the same system?

I mean, if you want to apply structured-pipeline architectures to, say, audio editing, you are welcome to do so, but why in hell should Cocoon have to deal with this?
Because ...
* it provides a good framework for this tasks

this tasks? what? generation of 3d rendering on the server? there are
much better frameworks to do 3d rendering, video/audio editing, or for
calling unix pipeline command line things.

* more and more data processing is done in XML (even publishing, 3D, music,
...)

so why do you need other data pipeline?

* it is neccessary to integrate both for migration from legacy data formats
to XML

we are already doing this thru adaptation of non-xml data formats to SAX
events and back.

You are very close to win the prize for the FS-award of the year :)
Oh, what a privilege ;-)
It *would* make sense to add these complexities only if processing performed in different realms could be interoperated. But I can't see how.

what does it mean to perform xstl-transformation on a video stream?

what does it mean to perform audio mixing on an email?
The 'misuse' you scetched, will be detected through the use of data formats:
* An XSLT-Transformer will only operate on "/text/xml"
* An Audio-Mixer will only operate on "/abstract/sound"

Bingo. So why should they live in the same project?

It would not make any sense to add functionalities inside cocoon that do not belong in the real of its problem space. It would only dilute the effort in the additional complexity only for sake of flexibility.
Cocoon is already used for data integration in may areas.

Integration means 'adaptation'. You are describing pipelines that *DO*
*NOT* collaborate, just share the same environment and description markup.

The possibilities of data itegration should not stop with the Reader
component

Readers are suppose to *read*. Period. They do not do data integration
at any stage.

and converting every legacy data format to XML before processing
it is not always possible.

Right. So, if it's not possible, Cocoon is not the right tool for you.
Easy enough.

[snip]
7.1 Web Services
----------------

As many of you know there are existing two popular styles
to use Web Services:

SOAP and REST.
Both have their own advantages and disadvantages but I'd
like to concentrate

on SOAP and on it's transport protocol independence,
because REST-style Web

Services are already possible to do with Cocoon.

SOAP allows us to use any transport protocol to deliver
SOAP messages. Mostly

HTTP(S) is used therefore, but there are many cases, where
you have to use

other protocols (like SMTP, FTP, ...).
Whatever protocol you chose to invoke your Web Services the
result should be

always the same and the response should be delivered back
through (mostly)

the same protocol. Here is one of the greatest advantages
of the protocol
independance.
No, this is not protocol independence. This is transport independance, you are still dependent on SOAP as a protocol.
What I meant was 'transport protocol independence'.

[snip]
8 Protocol Handler
==================
I don't think Cocoon should implement protocol handlers. Cocoon is a data producer, should not deal with transport.
I agree, that it is not the task of cocoon to deal with transporting.
But Cocoon does this already to a certain degree with the HTTP protocol
(headers!) and is therefore bound to the HTTP protocol.

SMTP has headers.

You can't easily serialize an SVG to a jpeg and deliver it via eMail.

We are already working on extending the Environment to do that.

So if I want to be able to deliver the output of a pipeline via different
transport channels I have to break up this tight binding to HTTP.

How familiar are you with the Cocoon Environment classes?

We already have enough problems to try to come up with an Enviornment that could work with both email and web (which have orthogonal client/server paradigms), I don't want to further increase the complexity down this road.
I know that this means additional complexity, but currently this complexity
is already hidden in other components (Reader, Serializer) and therefore
mixed with different concerns.

Serializers are adapters from SAX to the outside world of data formats.
Readers read.

They have different concerns and they are very well separated. Where is
the mix?

Why should an SVG2JPEG Serializer have to deal with HTTP headers?

The Serializers has to deal with Environment headers. How these headers
are translated depends on the Environment implementation which,
currently, is either web or command line and in the future will be mail.

I think seperation of concerns is not the case here.

I can't see how your proposed architecture can improve the use of
headers if not thru an adaptation system that would be comparable of
what we are using for the Environment.

[snip]

Anyway, I think you are doing the most common mistake of software
architects: software design by symmetry instead of following real-user
requirements. This is a *BIG* and dangerous anti-pattern that normally
kills software project and bloats them into gigantic messes.

Cocoon has been evolving thru progressive refinement of real-world
requirements. I can't see any in your outline.

No, web services don't suffice, I still have to see a real use of them
and Microsoft is pushing SOAP exactly because it's bloated and
paper-driven (something that they know how to politically control,
unlike HTTP and SMTP)

Sorry if I sound negative, but the impact of the architectural changes
you propose would be terrible to our user base and I don't want it to
see it happening.

--
Stefano Mazzocchi                               <[EMAIL PROTECTED]>
   Pluralitas non est ponenda sine necessitate [William of Ockham]
--------------------------------------------------------------------




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: AW: [PROPOSAL] Cocoon Science Fiction

Reply via email to