Re: URI Resolution

Chris Bowditch Tue, 21 Feb 2012 02:49:08 -0800

On 20/02/2012 23:16, Jeremias Maerki wrote:

Hi Mehdi,


Hi Jeremias,

Thanks very much for taking the time to write such a long andcomprehensive e-mail. There is some useful feedback below and I've askedMehdi to take a closer look at the details. Just one question from me atthe moment. What are the policy decisions that you think the project istaking that you don't agree with? Sorry but I hate generic statementslike that, if you are going to say something then you ought to give thedetails.


Thanks,

Chris


I'm on hiatus from working on FOP due to various experiences and the
direction the project has taken policy-wise in the last few months, but
I can't help but chime in here in fear of this taking a direction I feel
is not in the interest of the project (i.e. mainly of its users). In
this post, I'd like to address points from both this thread and the wiki
page: http://wiki.apache.org/xmlgraphics-fop/URIResolution

First, I would like to suggest you start with listing relevant code
portions on the wiki before anything else. Where is what? And what's the
problem with each case? Right now there are only a few vague pointers
and an underlying unhappiness with the various approaches.

javax.xml.transform.URIResolver is the standard interface for URI
resolution in the JAXP world which is the most important Java API for
XML processing. The SAX EntityResolver is basically its predecessor. The
most popular implementation of the two interfaces is probably Apache XML
Commons Resolver, an implementation of the XML Catalog standard. While
URIResolver uses Strings rather than URI instances (URIResolver
predating java.net.URI), it has served me very well to date. In a system
I'm currently building, URIs and URI resolvers (or "resource resolvers"
if you want to generalize) have introduced a huge flexibility in how
resources can be resolved and accessed (be that XML or not).

Granted, URIResolver has been designed for JAXP, an XML API, but we're
not only loading XML but other resources like fonts, images, hyphenation
patterns, configuration files etc., too, where an InputStream is usually
sufficient. Using a URIResolver for non-XML resources introduces some
inconvenience or complexity but OTOH allows to re-use of URIResolvers
for resources types other than XML. Two examples:

1. An XML Catalog can be used resolve URIs to actual URLs so you don't
have to know at stylesheet development date, where the actual resource
will come from at runtime. Unfortunately, the Apache XML Commons
Resolver returns SAXSource objects with only the System ID (aka URL) set
which is not quite intuitive but doesn't really present a problem.

2. In my (OSGi-based) system I'm using the composite pattern to
aggregate all available URIResolver instances available in the service
registry to one. All kinds of OSGi bundles can contribute URIResolvers
which I can then set as one on FOP's FopFactory to resolve whatever
scenario you can think of.

When working with XML (SVG, MathML etc. images) it can be rather useful
if you can resolve a URI to a DOMSource or a SAXSource, as you can
generate an SVG image on the fly without the need to serialize the XML
only to have FOP re-parse it again (Performance!). We've had a number of
inquiries on the user list about this sort of thing. In XGC, I've
introduced the ImageStreamSource which instead of an InputStream
provides an ImageIO ImageInputStream (for random access). Reducing the
system to InputStream only would require the InputStream to always be
wrapped in an ImageInputStream which would buffer the file in memory or
in a temporary directory even though we might have random access already
on a local file. Caching open streams between the preloading and loading
stages can help avoid network latency.

I think it can be useful to think about simplifying the use of
URIResolver to new interface for resource resolution where only
InputStreams are required. IMO, it should then still be possible to use
a URIResolver to resolve those URIs. An adapter for URIResolvers should
be possible to write.

As for the topic of base URIs, I disagree that we don't have access to a
base URI when using a SAX ContentHandler. It's perfectly feasible to
track the current base URI which reacting to SAX events, ex. when
implementing support for XML Base: http://www.w3.org/TR/xmlbase/

I get the impression that you're suggesting that only a single base URI
(on the FopFactory?) is required. In the past, we've had to add multiple
base URIs precisely because there isn't a single base URI. Some URIs
need to be resolved relative to the input FO document (base URI on
FOUserAgent.base). Or they need to be resolved relative to the XSLT
stylesheet in use (images may or may not be stored next to the XSLT
stylesheets). Fonts (FontManager.fontBase) and hyphenation patterns
(FopFactory.hyphenBase) may be at a different location respectively. Or
they could simply be relative to the main configuration file
(FopFactory.base). Granted, that adds complexity but also flexibility
for those who need it.

Looking at that, the signature "InputStream getInputStream(URI)" may be
insufficient. Like in URIResolver, you may need to extend that to
"InputStream getInputStream(URI resource, URI base)", so you can get the
URI resolved against the applicable base URI of the context you're
working in (fonts, hyph patterns, config files etc.). OTOH, we have some
special resolution interfaces (like FontResolver) which don't have a
base URI because it is implicit and handled by the caller. The various
specialized resolver interfaces help decouple the various packages from
neighbouring ones to reduce dependencies.

In this context, I find it suboptimal when there are dependencies on
org.apache.fop.apps from packages like "fonts", "pdf" or "hyph" because
they have the potential to be used independently from FOP. More than
once did I have to adjust changes that caused the PDF library to have
unnecessary dependencies into new packages (ex. the FOUserAgent which
even from its name doesn't have anything to do with a basic PDF library).
In this spirit, I like how Victor Mote took his FOP fork (FOray) apart
into multiple modules with clearly defined dependencies. We've had
discussions about doing similar things but have not come to a consensus
which is why we still have the huge, scary (for newbies), single source
tree. Having the renderers in separate subprojects could allow people to
scale FOP down to the subsets they need. Only a few really need AFP but
it adds a lot of byte code to fop.jar. Having done a lot of OSGi on the
past years, I have come to appreciate smaller JARs (Bundles in OSGi talk).
This approach forces better package design and management of
dependencies.

In Batik land, the build produces a number of subsystem JARs besides the
"all-jar" which we bundle with FOP. Having worked on Batik, I found it a
challenge to deal with one huge file tree producing multiple JARs while
keeping the dependencies in order. It's a bit easier in FOP but not
everyone pays attention to this.

Personally, I'd still love to see FOP split up into: core, util, hyph,
fonts, pdf, afp, pcl, etc. and getting the XGC support and other stuff
for SVG into Batik. Related to this:
http://wiki.apache.org/xmlgraphics/XmlGraphicsCommonComponents

But I'm getting off course...

As for "OutputStream getOutputStream(URI)", I would put that into a
separate interface since IMO it mixes concerns. The input side should be
easy to integrate with URIResolver, but the output side will produce a
problem here. Usually, you only need one or the other, but rarely both
(I think the font cache is an example of the combined case). When
generating pages as PNG, you have a special case where we have to pass
in a file name from which other file names are derived to produce
multiple files (PNG is strictly single-page). That already warrants a
special interface for that purpose which could use the
"getOutputStream(URI)" interface (standard functionality currently in
MultiFileRenderingUtil).

Finally, a few more words about Batik: Batik does not support
URIResolvers or EntityResolvers which has hurt me more than once. You
have to do tricks by either registering URL handlers or a
ParsedURLProtocolHandler, both of which are registered in a static and
otherwise inflexible fashion. Refactoring this would be a major tasks
since, like URIResolvers in FOP, ParsedURL is used all over the Batik
place. AbstractFOPImageElementBridge, for example, intercepts the
ParsedURL (which should actually be considered a URI, not a URL) to load
external images using the XGC image loader rather than Batik's own image
support.

In the end, I'd like to ask you:
- to pay attention to package dependencies (keeping them at a minimum)
- to avoid reducing the chance that FOP may be split up into clean
modules in the future
- to minimize backwards-incompatible API changes (removing
FopFactory.setURIResolver() should not be necessary, for example)
- to keep the ability to do plain URI ->  InputStream resolution using
URIResolver somehow.
- to preserve the ability to use DOMSource and SAXSource as image
sources, i.e. to not change the XGC image loaders.

I think it makes sense to look at the cases first, where java.io.File or
FileInput/OutputStream is used, if one of your goal is to have FOP avoid
using local files directly.

HTH

Other related links:
- http://wiki.apache.org/xmlgraphics-fop/HowTo/XmlCommonsResolver

On 09.02.2012 17:08:51 mehdi houshmand wrote:

Hi,

As I've said previously, I've been looking at unifying URI resolution,
I've looked at a lot of the code regarding this and from what I can
see FOP uses file access for the following types of files:
1) Input/Output files - by that I mean FO and output, both of which
are many-to-many
2) Resources - fonts, images, hyphenation-patterns, colour profiles etc
3) AFP resource file - arguably could be an Output type, but not
handled in the same way
4) Scratch files - used for caching and optimize-resources etc

I think a lot of the URI differentiation can be done within the URI
itself, so we can use just an interface with two methods:

InputStream getInputStream(URI);
OutputStream getOutputStream(URI);

This interface will be bound to the FOUserAgent and a setter on the
user agent will allow clients to define their own implementation.

I think we can avoid having "Source getSource(URI)" in the API by
using converting them into a javax.xml.transform.Source when necessary
with "new StreamSource(InputStream)" (same for
OutputStream->StreamResult). The only issue here is that Source
objects also hold their URI, so if the source object is created one
place, and the URI is read in another, that could be problematic.
They're passed around a lot, so it's not so easy to chase them all the
way through the rabbit hole. There are, however, a few more unknowns
most relating to images, because I haven't seen how these are used in
xgc-commons or batik. But no doubt we're going to have to make API
changes to them anyway so we might as well cross that bridge when we
come to it.

I think we do this incrementally though, since it's going to touch so
many areas of code, I think the best idea is to do this in steps. I
will create a branch for Apache so each changes are made publicly
available, so do keep an eye on it, since the changes will have quite
far reaching effects. In terms of the nuts and bolts, I think the
biggest difficulties here are to do with clean up. There is a
disappointing amount of very similar but not quite the same code which
is going to be tricksy.

In terms of worries about regressions or the like, I will do my best
to minimize any impact, however, there are quite a few different URI
resolution methodologies. If say you're using a custom FontResolver
and a custom FOUriResolver, then there will be an impact.

So I think the best course of action is to start with the fonts
packages, since it's probably the area of code I'm most comfortable
with, replacing the various URI resolution methods with a single one.

Thoughts?

Mehdi




Jeremias Maerki

Re: URI Resolution

Reply via email to