Hi Mehdi, I'm on hiatus from working on FOP due to various experiences and the direction the project has taken policy-wise in the last few months, but I can't help but chime in here in fear of this taking a direction I feel is not in the interest of the project (i.e. mainly of its users). In this post, I'd like to address points from both this thread and the wiki page: http://wiki.apache.org/xmlgraphics-fop/URIResolution
First, I would like to suggest you start with listing relevant code portions on the wiki before anything else. Where is what? And what's the problem with each case? Right now there are only a few vague pointers and an underlying unhappiness with the various approaches. javax.xml.transform.URIResolver is the standard interface for URI resolution in the JAXP world which is the most important Java API for XML processing. The SAX EntityResolver is basically its predecessor. The most popular implementation of the two interfaces is probably Apache XML Commons Resolver, an implementation of the XML Catalog standard. While URIResolver uses Strings rather than URI instances (URIResolver predating java.net.URI), it has served me very well to date. In a system I'm currently building, URIs and URI resolvers (or "resource resolvers" if you want to generalize) have introduced a huge flexibility in how resources can be resolved and accessed (be that XML or not). Granted, URIResolver has been designed for JAXP, an XML API, but we're not only loading XML but other resources like fonts, images, hyphenation patterns, configuration files etc., too, where an InputStream is usually sufficient. Using a URIResolver for non-XML resources introduces some inconvenience or complexity but OTOH allows to re-use of URIResolvers for resources types other than XML. Two examples: 1. An XML Catalog can be used resolve URIs to actual URLs so you don't have to know at stylesheet development date, where the actual resource will come from at runtime. Unfortunately, the Apache XML Commons Resolver returns SAXSource objects with only the System ID (aka URL) set which is not quite intuitive but doesn't really present a problem. 2. In my (OSGi-based) system I'm using the composite pattern to aggregate all available URIResolver instances available in the service registry to one. All kinds of OSGi bundles can contribute URIResolvers which I can then set as one on FOP's FopFactory to resolve whatever scenario you can think of. When working with XML (SVG, MathML etc. images) it can be rather useful if you can resolve a URI to a DOMSource or a SAXSource, as you can generate an SVG image on the fly without the need to serialize the XML only to have FOP re-parse it again (Performance!). We've had a number of inquiries on the user list about this sort of thing. In XGC, I've introduced the ImageStreamSource which instead of an InputStream provides an ImageIO ImageInputStream (for random access). Reducing the system to InputStream only would require the InputStream to always be wrapped in an ImageInputStream which would buffer the file in memory or in a temporary directory even though we might have random access already on a local file. Caching open streams between the preloading and loading stages can help avoid network latency. I think it can be useful to think about simplifying the use of URIResolver to new interface for resource resolution where only InputStreams are required. IMO, it should then still be possible to use a URIResolver to resolve those URIs. An adapter for URIResolvers should be possible to write. As for the topic of base URIs, I disagree that we don't have access to a base URI when using a SAX ContentHandler. It's perfectly feasible to track the current base URI which reacting to SAX events, ex. when implementing support for XML Base: http://www.w3.org/TR/xmlbase/ I get the impression that you're suggesting that only a single base URI (on the FopFactory?) is required. In the past, we've had to add multiple base URIs precisely because there isn't a single base URI. Some URIs need to be resolved relative to the input FO document (base URI on FOUserAgent.base). Or they need to be resolved relative to the XSLT stylesheet in use (images may or may not be stored next to the XSLT stylesheets). Fonts (FontManager.fontBase) and hyphenation patterns (FopFactory.hyphenBase) may be at a different location respectively. Or they could simply be relative to the main configuration file (FopFactory.base). Granted, that adds complexity but also flexibility for those who need it. Looking at that, the signature "InputStream getInputStream(URI)" may be insufficient. Like in URIResolver, you may need to extend that to "InputStream getInputStream(URI resource, URI base)", so you can get the URI resolved against the applicable base URI of the context you're working in (fonts, hyph patterns, config files etc.). OTOH, we have some special resolution interfaces (like FontResolver) which don't have a base URI because it is implicit and handled by the caller. The various specialized resolver interfaces help decouple the various packages from neighbouring ones to reduce dependencies. In this context, I find it suboptimal when there are dependencies on org.apache.fop.apps from packages like "fonts", "pdf" or "hyph" because they have the potential to be used independently from FOP. More than once did I have to adjust changes that caused the PDF library to have unnecessary dependencies into new packages (ex. the FOUserAgent which even from its name doesn't have anything to do with a basic PDF library). In this spirit, I like how Victor Mote took his FOP fork (FOray) apart into multiple modules with clearly defined dependencies. We've had discussions about doing similar things but have not come to a consensus which is why we still have the huge, scary (for newbies), single source tree. Having the renderers in separate subprojects could allow people to scale FOP down to the subsets they need. Only a few really need AFP but it adds a lot of byte code to fop.jar. Having done a lot of OSGi on the past years, I have come to appreciate smaller JARs (Bundles in OSGi talk). This approach forces better package design and management of dependencies. In Batik land, the build produces a number of subsystem JARs besides the "all-jar" which we bundle with FOP. Having worked on Batik, I found it a challenge to deal with one huge file tree producing multiple JARs while keeping the dependencies in order. It's a bit easier in FOP but not everyone pays attention to this. Personally, I'd still love to see FOP split up into: core, util, hyph, fonts, pdf, afp, pcl, etc. and getting the XGC support and other stuff for SVG into Batik. Related to this: http://wiki.apache.org/xmlgraphics/XmlGraphicsCommonComponents But I'm getting off course... As for "OutputStream getOutputStream(URI)", I would put that into a separate interface since IMO it mixes concerns. The input side should be easy to integrate with URIResolver, but the output side will produce a problem here. Usually, you only need one or the other, but rarely both (I think the font cache is an example of the combined case). When generating pages as PNG, you have a special case where we have to pass in a file name from which other file names are derived to produce multiple files (PNG is strictly single-page). That already warrants a special interface for that purpose which could use the "getOutputStream(URI)" interface (standard functionality currently in MultiFileRenderingUtil). Finally, a few more words about Batik: Batik does not support URIResolvers or EntityResolvers which has hurt me more than once. You have to do tricks by either registering URL handlers or a ParsedURLProtocolHandler, both of which are registered in a static and otherwise inflexible fashion. Refactoring this would be a major tasks since, like URIResolvers in FOP, ParsedURL is used all over the Batik place. AbstractFOPImageElementBridge, for example, intercepts the ParsedURL (which should actually be considered a URI, not a URL) to load external images using the XGC image loader rather than Batik's own image support. In the end, I'd like to ask you: - to pay attention to package dependencies (keeping them at a minimum) - to avoid reducing the chance that FOP may be split up into clean modules in the future - to minimize backwards-incompatible API changes (removing FopFactory.setURIResolver() should not be necessary, for example) - to keep the ability to do plain URI -> InputStream resolution using URIResolver somehow. - to preserve the ability to use DOMSource and SAXSource as image sources, i.e. to not change the XGC image loaders. I think it makes sense to look at the cases first, where java.io.File or FileInput/OutputStream is used, if one of your goal is to have FOP avoid using local files directly. HTH Other related links: - http://wiki.apache.org/xmlgraphics-fop/HowTo/XmlCommonsResolver On 09.02.2012 17:08:51 mehdi houshmand wrote: > Hi, > > As I've said previously, I've been looking at unifying URI resolution, > I've looked at a lot of the code regarding this and from what I can > see FOP uses file access for the following types of files: > 1) Input/Output files - by that I mean FO and output, both of which > are many-to-many > 2) Resources - fonts, images, hyphenation-patterns, colour profiles etc > 3) AFP resource file - arguably could be an Output type, but not > handled in the same way > 4) Scratch files - used for caching and optimize-resources etc > > I think a lot of the URI differentiation can be done within the URI > itself, so we can use just an interface with two methods: > > InputStream getInputStream(URI); > OutputStream getOutputStream(URI); > > This interface will be bound to the FOUserAgent and a setter on the > user agent will allow clients to define their own implementation. > > I think we can avoid having "Source getSource(URI)" in the API by > using converting them into a javax.xml.transform.Source when necessary > with "new StreamSource(InputStream)" (same for > OutputStream->StreamResult). The only issue here is that Source > objects also hold their URI, so if the source object is created one > place, and the URI is read in another, that could be problematic. > They're passed around a lot, so it's not so easy to chase them all the > way through the rabbit hole. There are, however, a few more unknowns > most relating to images, because I haven't seen how these are used in > xgc-commons or batik. But no doubt we're going to have to make API > changes to them anyway so we might as well cross that bridge when we > come to it. > > I think we do this incrementally though, since it's going to touch so > many areas of code, I think the best idea is to do this in steps. I > will create a branch for Apache so each changes are made publicly > available, so do keep an eye on it, since the changes will have quite > far reaching effects. In terms of the nuts and bolts, I think the > biggest difficulties here are to do with clean up. There is a > disappointing amount of very similar but not quite the same code which > is going to be tricksy. > > In terms of worries about regressions or the like, I will do my best > to minimize any impact, however, there are quite a few different URI > resolution methodologies. If say you're using a custom FontResolver > and a custom FOUriResolver, then there will be an impact. > > So I think the best course of action is to start with the fonts > packages, since it's probably the area of code I'm most comfortable > with, replacing the various URI resolution methods with a single one. > > Thoughts? > > Mehdi Jeremias Maerki
