Re: URI Resolution

mehdi houshmand Tue, 21 Feb 2012 06:40:07 -0800

Hi Jeremias, I'll address your concerns inline:

<snip/>
> First, I would like to suggest you start with listing relevant code
> portions on the wiki before anything else. Where is what? And what's the
> problem with each case? Right now there are only a few vague pointers
> and an underlying unhappiness with the various approaches.


My apologies for any lack of clarity on my part, the issues around URI
resolution are in no way due to any unhappiness on my part. Infact,
rather frustratingly, great effort has been made to have fallback
mechanisms to  allow ambiguity from the user (which I'm sure we all
know happens all too often). However, as far as I can tell,  URI
resolution is a single problem, and as such, should have a single
solution. Now, I do appreciate there are nuances involved, but
allowing for them (which I'll discuss later) there should be a single
URI resolution mechanism.

The problem I'm trying to tackle here is how do we sandbox FOPs file
access? With the current implementation, that isn't possible. There's
too much contingency i.e. if this resolved URI doesn't exist, check
this one. As I've said, in the cloud, we have to be very strict, we
cannot allow one user to gain access (intentionally or otherwise) to
another users data.

<snip/>

> I think it can be useful to think about simplifying the use of
> URIResolver to new interface for resource resolution where only
> InputStreams are required. IMO, it should then still be possible to use
> a URIResolver to resolve those URIs. An adapter for URIResolvers should
> be possible to write.

I spent quite some time deliberating on which approach would be best,
returning an InputStream or a Source object. The problem is, the only
time FOP actually reads XML is when parsing SVG. Even reading the FO
is done by the JAXP transformer. So I do appreciate the JAXP system is
tried and tested, but using that API isn't the best approach, IMO. The
reason being that everytime we want to convert a Source object to an
InputStream, we need to re-write the code that does so, which is
non-trivial since that is where the URI is actually resolved. We could
cast Source to StreamSource, but that returns an InputStream anyway.

<snip/>

> I get the impression that you're suggesting that only a single base URI
> (on the FopFactory?) is required. In the past, we've had to add multiple
> base URIs precisely because there isn't a single base URI. Some URIs
> need to be resolved relative to the input FO document (base URI on
> FOUserAgent.base). Or they need to be resolved relative to the XSLT
> stylesheet in use (images may or may not be stored next to the XSLT
> stylesheets). Fonts (FontManager.fontBase) and hyphenation patterns
> (FopFactory.hyphenBase) may be at a different location respectively. Or
> they could simply be relative to the main configuration file
> (FopFactory.base). Granted, that adds complexity but also flexibility
> for those who need it.

So here's the problem: whichever client that is calling FOP, gives it
a URI resolver. This resolver, all it does, is convert a URI to an
InputStream, it shouldn't need to hold any state (i.e. base URI). Now,
having all these base URIs is going to get pretty confusing no? I
think all that's needed is font-base and base (defined in the
fop.xconf). Without getting too much into the nuts and bolts, this is
where the wrapper comes in. The wrapper holds the state (defined by
the user when FopFactory is instantiated and/or in the fop.xconf), and
it can resolve against the base, giving the user defined resolver an
absolute URI to read from.

This would allow users to define their own URIs and a single
resolution mechanism. Not only does this give the security of
sandboxing, it also allows for the full flexibility of URIs to be
exploited. The user can define their own schemes, queries etc and the
resource being read doesn't even need to be on the file system. It
could be in a database; a remote resource; whatever as long as it can
be resolved and converted to an InputStream.

>
> Looking at that, the signature "InputStream getInputStream(URI)" may be
> insufficient. Like in URIResolver, you may need to extend that to
> "InputStream getInputStream(URI resource, URI base)", so you can get the
> URI resolved against the applicable base URI of the context you're
> working in (fonts, hyph patterns, config files etc.). OTOH, we have some
> special resolution interfaces (like FontResolver) which don't have a
> base URI because it is implicit and handled by the caller. The various
> specialized resolver interfaces help decouple the various packages from
> neighbouring ones to reduce dependencies.

I think I've addressed most of these concerns above, but I believe the
user already defines a base with <fop-base> or <base> in the
fop.xconf. So these should be used to resolve relative URIs. In terms
of decoupling, I don't think I could agree more. The fonts packages
especially are in dire need of some TLC, and extracting them to their
own module is what I've been pushing for.

However, let's be realistic here, as they stand, they're not a
library. There is far too much coupled to the rest of FOP and giving
them a URI resolver, isn't really really adding much to the bindings.
It's all done in a single class. Also, because I plan on removing all
the URI Strings, it would probably actually help in making it more of
a library. The fonts library shouldn't have to care about URI
resolution. You should give it an InputStream and it should do what it
does.

> In this context, I find it suboptimal when there are dependencies on
> org.apache.fop.apps from packages like "fonts", "pdf" or "hyph" because
> they have the potential to be used independently from FOP. More than
> once did I have to adjust changes that caused the PDF library to have
> unnecessary dependencies into new packages (ex. the FOUserAgent which
> even from its name doesn't have anything to do with a basic PDF library).
> In this spirit, I like how Victor Mote took his FOP fork (FOray) apart
> into multiple modules with clearly defined dependencies. We've had
> discussions about doing similar things but have not come to a consensus
> which is why we still have the huge, scary (for newbies), single source
> tree. Having the renderers in separate subprojects could allow people to
> scale FOP down to the subsets they need. Only a few really need AFP but
> it adds a lot of byte code to fop.jar. Having done a lot of OSGi on the
> past years, I have come to appreciate smaller JARs (Bundles in OSGi talk).
> This approach forces better package design and management of
> dependencies.

I think we're in danger of violently agreeing with each other here. My
plan is to move the URI resolver to XGC, as such, there'll be a single
resolver for the whole project. There will be no superfluous
dependencies floating
around.

> In Batik land, the build produces a number of subsystem JARs besides the
> "all-jar" which we bundle with FOP. Having worked on Batik, I found it a
> challenge to deal with one huge file tree producing multiple JARs while
> keeping the dependencies in order. It's a bit easier in FOP but not
> everyone pays attention to this.
>
> Personally, I'd still love to see FOP split up into: core, util, hyph,
> fonts, pdf, afp, pcl, etc. and getting the XGC support and other stuff
> for SVG into Batik. Related to this:
> http://wiki.apache.org/xmlgraphics/XmlGraphicsCommonComponents
>
> But I'm getting off course...

Off course maybe, but I like the direction! I absolutely agree.

> As for "OutputStream getOutputStream(URI)", I would put that into a
> separate interface since IMO it mixes concerns. The input side should be
> easy to integrate with URIResolver, but the output side will produce a
> problem here. Usually, you only need one or the other, but rarely both
> (I think the font cache is an example of the combined case). When
> generating pages as PNG, you have a special case where we have to pass
> in a file name from which other file names are derived to produce
> multiple files (PNG is strictly single-page). That already warrants a
> special interface for that purpose which could use the
> "getOutputStream(URI)" interface (standard functionality currently in
> MultiFileRenderingUtil).

I respectfully disagree. URI resolution should address just that,
resolving URIs into an interface so that FOP can read bits/bytes. The
resolution mechanism should be the same regardless of whether you're
reading or writing.

> Finally, a few more words about Batik: Batik does not support
> URIResolvers or EntityResolvers which has hurt me more than once. You
> have to do tricks by either registering URL handlers or a
> ParsedURLProtocolHandler, both of which are registered in a static and
> otherwise inflexible fashion. Refactoring this would be a major tasks
> since, like URIResolvers in FOP, ParsedURL is used all over the Batik
> place. AbstractFOPImageElementBridge, for example, intercepts the
> ParsedURL (which should actually be considered a URI, not a URL) to load
> external images using the XGC image loader rather than Batik's own image
> support.

Yeah, I haven't actually looked at Batik yet. We'll cross that bridge
when we get to it. No doubt, it's going to be a barrel o' laughs. I
also haven't looked at XGC, which I do appreciate is something we need
to look at. My intention was to do this incrementally, I think, since
it's so sensitive to change to address one thing at a time.

> In the end, I'd like to ask you:
> - to pay attention to package dependencies (keeping them at a minimum)
> - to avoid reducing the chance that FOP may be split up into clean
> modules in the future
> - to minimize backwards-incompatible API changes (removing
> FopFactory.setURIResolver() should not be necessary, for example)
> - to keep the ability to do plain URI -> InputStream resolution using
> URIResolver somehow.
> - to preserve the ability to use DOMSource and SAXSource as image
> sources, i.e. to not change the XGC image loaders.

I think, aside from some minor API changes, we are doing all the above
apart the last. I'll review XGC shortly and obviously put any ideas in
a public forum for futher discussion.

<snip/>

Thanks a lot for addressing your concerns here, I appreciate I may
have been a bit vague on the details on the Wiki, but I've just
started writing the actual code. I'll try and update the wiki with a
bit more information, my only worry is not to get lost in minutiae.

Mehdi

Re: URI Resolution

Reply via email to