Re: [cocoon3] Stax Pipelines

Sylvain Wallez Tue, 02 Dec 2008 08:17:02 -0800

Reinhard Pötz wrote:

I've had Stax pipelines on my radar for a rather long time because I
think that Stax can simplify the writing of transformers a lot.
I proposed this idea to Alexander Schatten, an assistant professor at
the Vienna University of Technology and he then proposed it to his
students.


A group of four students accepted to work on this as part of their
studies. Steven and I are coaching this group from October to January
and the goal is to support Stax pipeline components in Cocoon 3.

So far the students learned more about Cocoon 3, Sax, Stax and did some
performance comparisons. This week we've entered the phase where the
students have to work on the actual Stax pipeline implementation.

I asked the students to introduce themselves and also to present the
current ideas of how to implement Stax pipelines. So Andreas, Killian,
Michael and Jakob, the floor is yours!

I have spent some cycles on this subject and came to the surprisingconclusion that writing Stax _pipelines_ is actually rather complex.

A Stax transformer pulls events from the previous component in thepipeline, which removes the need for the complex state machinery oftenneeded for SAX (push) transformers by transforming it in a simplefunction call stack and local variables. This is the main interest ofStax vs SAX.

But how does a transformer expose its result to the next component inthe chain so that this next component can also pull events in the Staxstyle?

When it produces an event, a Stax transformer should put this eventsomewhere so that it can be pulled and processed by the next component.But pulling also means the transformer does not suspend its executionsince it continues pulling events from the previous component. This isactually reflected in the Stax API which provides a pull-basedXMLStreamReader, but only a very SAX-like XMLStreamWriter.


So a Stax transformer is actually a pull input / push output component.

To allow the next component in the pipeline to be also push-based, thereare 3 solutions (at least this is what I came up with) :


Buffering
---------

The XMLStreamWriter where the transformer writes to buffers all eventsin a data structure similar to our XMLByteStreamCompiler, that can beused as a XMLStreamReader by the next component in the chain. Thepipeline object then has to call some execute() method on everycomponent in the pipeline in sequence, after having provided them withthe proper buffer-based reader and writer.

Execution is single-threaded, which fits well with all the nonthreadsafe classes and threadlocals we usually have in web applications,but requires buffering and thus somehow defeats the purpose ofstream-based processing and can be simply not possible to process largedocuments.

Note however that because it is single-threaded, we can work with twobuffers (one for input, one for output) that are reused whatever thenumber of components in the pipeline.


Multithreading
--------------

Each component of the pipeline runs in a separate thread, and writes itsoutput into an event queue that is consumed asynchronously by the nextcomponent in the pipeline. The event queue is presented as anXMLStreamReader to the next component.

This approach requires very little buffering (and we can even have anupper bound on the event queue size). It also uses nicely the parallelproccessing capabilities of multi-core CPUs, although in web apps theparallelism is also handled by concurrent http requests. This istypically the approach that would be used with Erlang or Scala actors.

Multithreading has some issues though, since the servlet API more orless implies that a single thread processes the request and we may havesome concurrency issues. Web app developers also take single threadingas a basic assumption and use threadlocals here and there.

This approach also prevents the reuse of char[] buffers as is usuallydone by XML parsers since events are processed asychronously. All char[]have to be copied, but this is a minor issue.


Continuations
-------------

When a transformer sends an event to the next component in the chain,its execution is suspended and captured in a continuation. Thecontinuation of the next pipeline component is resumed until it hasconsumed the event. We then switch back to the current component untilit produces an event, etc, etc.

This approach is single-threaded and so avoids the concurrency issuesmentioned above, and also avoids buffering. But there is certainly ahigh overhead with the large number of continuation capturing/resuming.This number can be reduced though is we have some level of buffering toallow processing of several events in one capture/resume cycle.

It also requires all the bytecode of transfomers to be instrumented forcontinuations, which in itself adds quite some memory and processingoverhead. Torsten also posted on this subject quite long ago [1].



Conclusion
----------

All things considered, I came to the conclusion that a full Staxpipeline either requires buffering to be reliable (but we're no morestreaming), or requires very careful inspection of all components formulti-threading issues.

So in the end, Stax probably has to be considered as a helper _inside_ acomponent to ease processing : buffer all SAX input, then pull thereceived events to avoid complex state automata.

Looks like I'm in a "long mail" period and I hope I haven't lost anybodyhere :-)


So, what do you think?

Sylvain

[1] http://vafer.org/blog/20060807003609

--
Sylvain Wallez - http://bluxte.net

Re: [cocoon3] Stax Pipelines

Reply via email to