[xml-dev] Is the flat XML format the best approach for XSLT? (earlier: office:document-content vs office:document)

Ian Shields Mon, 10 Dec 2007 16:06:57 -0800

> The reason why the XSLT processor requires office:document and not 
> office:document-content is that it expects to receive one single 
> unzipped XML stream/file, where at least content.xml, meta.xml and 
> styles.xml are being merged to one, often called the "flat XML file".
> Even embedded pictures of the ODF document are being merged into this 
> XML file Base64 encoded.
> 
> 
> Is this the best way to do it?
> 
> It was very tempting for a XSLT developer to have all sources in one 
> input file, but now after gaining some experience I think the drawbacks 
> weight more.


Perhaps. Perhaps not.

> 
> Most annoying to me the stream contains resources, which are not 
> required by most transformations.
> 
> For instance, I never wrote a stylesheet, where the XSLT processor ever 
> used the encoded images in the flat stream.

That's probably because (as noted in your next point) the images are
encoded in Base64 encoding. You need that to make them legitimate XML,
but that also has the effect that it is practically impossible to do
anything useful with them in the context of a stylesheet. It seems many
folks are searching the web for a solution to extracting images from
OpenOffice documents and the task is listed as a contribution area
(http://wiki.services.openoffice.org/wiki/Xml#Access_to_images).

>From the point of view of a stylesheet, it's actually better to process
linked images rather than embedded images. At least in my opinion.
Unfortunately, OpenOffice.org has tools to break links and thus convert
linked images to embedded images in its edit menu, but apparently no
way to perform the reverse operation. The code snippet
(http://wiki.services.openoffice.org/wiki/Xml#Access_to_images) only
seem to work with older .sxw format files, so isn't much use with .odt.

I think this is a bigger problem for an export filter in that such a
filter is likely to want to image extracted as a separate file. Or at
least be able to process the native image format in some way.

Having wrestled with this issue myself, I'm leaning to the idea of
writing a Java extension function that could take the stream, as input,
decode the base64 data and figure the mime or file type from the magic
number in the datastream. Given something like that, I think the base64
encoded images might suddenly become much more useful to an XSLT
filter. Maybe adding a few such utility functions to OpenOffice.org
would be a good thing to do. Of course, you could write a template to
figure the mime type and other information about the image, but I think
that might be more painful than a java extension and you still have no
XSLT way to decode the base64 and save it as binary.
> 
> Images are being encoded to base64, size is being extended by 33%, 
> processed by the XML parser, XSLT processor and then being neglected.
> Seems like we are wasting resources here..
> 
We are wasting resources, but it may or may not matter. It certainly
matters in an environment needing high-performance, but is probably
less important for things that happen relatively occasionally at human
interface speed.

> An optimized transformation would only choose the streams of the 
> package, which are important to process.
> 
> A much better approach to me would be if the transformation would
> process the manifest via XSLT and choose the desired streams among all 
> possible streams.
> These package streams could be accessed via the XSLT document()function 
> and a Office Handler resolving these calls.

This might work.
> 
> An additional problem that would be solved by this approach, is the 
> processing of user XML in the package. Remember anyone could add streams 
> to the package as long the stream is listed in the manifest.
> Currently it is not specified nor implemented how to handle all user 
> streams into the flat XML by a generic approach.
> 
> Considering the earlier mentioned waste of resources due to unnecessary 
> encoding&parsing this seems to me the wrong approach anyway.
> 
> Finally if an XSL transformation is based not on a flat xml format, but 
> on the package format, the similar transformation can be easily be used 
> outside of the office, for instance as part of a browser extension.
> 
This would require somethign to resolve the document() calls for other
parts of the package.
> Anyone here who would be able and interested to make such an improvement 
> come alive?
> 
> Svante


-- 
Ian Shields <[EMAIL PROTECTED]> or <[EMAIL PROTECTED]>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[xml-dev] Is the flat XML format the best approach for XSLT? (earlier: office:document-content vs office:document)

Reply via email to