Re: Best Interface for reading OpenType Files

Jeremias Maerki Thu, 24 Sep 2009 12:06:36 -0700

On 24.09.2009 17:53:29 Alexander Kiel wrote:
> Hi,
> 
> I currently thinking about the interface to use for reading OpenType
> files.
> 
> There are two possibilities:
> 
>  - reading on top of an InputStream or
>  - reading on top of a RandomAccessFile or FileChannel.
> 
> Currently the implementation in FOP uses the class FontFileReader which
> expects an InputStream. But it immediately calls IOUtils.toByteArray(in)
> and works on that byte array instead. So it needs to hold the file
> completely in memory.


Right, and that accounts for a pretty large portion of FOP's memory
consumption problem nowadays. With the use of OpenType fonts, this gets
worse as they can be quite big. I'm glad you noticed that.

> FontBox which is part of PDFBox uses some abstract class called
> TTFDataStream with template methods which has two implementations, one
> called RAFDataStream which operates on top of a RandomAccessFile and one
> called MemoryTTFDataStream which operates on top of a byte array.

So if you access the font via a URL that is not a file URL, you still
get a memory problem.

> I started using pure InputStreams. That means I implemented the whole
> OpenType file reading using a hierarchy of FilterInputStreams. At the
> lowest level I have a DataInputStream which takes every Inputstream and
> provides methods to read the basic data types of OpenType just like
> java.io.DataInputStream does for java data types. On top of that, I have
> streams that can read some small scale data structures, than streams
> which can read whole tables and finally a stream which can read the
> whole OpenType file.

Yeah, that's the ideal world.

> To read an OpenType file, all you have to write is:
> 
>     InputStream in = ...
>     OpenTypeFileInputStream otfIn = new OpenTypeFileInputStream(in);
>     OpenTypeFile otf = otfIn.readOpenTypeFile();
> 
> In my opinion this system works really good. You can take every
> InputStream, the reading is decoupled from the OpenType classes itself
> and you can test peaces of OpenType structure using only the individual
> streams.
> 
> But! My approach has one flaw. I need to seek extensively while reading
> an OpenType file. The whole file format consists of headers with offsets
> and data structures which one has to read from that offsets.
> 
> To get this seeking work with streams, I use mark(), reset() and skip().
> My common approach at the beginning of such a structure is to mark, than
> read the header and for every part, reset to the start, mark again, skip
> to the offset and read the part.
> 
> But with this approach I'm ending up to hold the whole file in memory.
> 
> To make it worse, this mark(), reset(), skip() interface doesn't support
> hierarchical marking. If I seek inside smaller scale structures the mark
> position of the larger scale structure is overwritten. I don't think
> that it is possible to build hierarchical mark support on top of any
> markable InputStream. (Oh look I did it later as I wrote this longish
> mail.) I think, one have to reimplement BufferedInputStream holding ones
> own byte array. In fact I did this on top of ByteArrayInputStream. The
> key problem is that one can't get a position out of an InputStream which
> does not surprise as the concept of streams doesn't have a position. 

May I suggest to use ImageIO's ImageInputStream? That already has an
implementation that buffers the stream in a temporary file (if allowed)
so you basically have random access. I've used that extensively in the
image loading framework in XML Graphics Commons and it seem to be
ideal for what you need to do. You even get the hierarchical mark/reset.

> It is possible to read the parts in offset order. But there are
> duplicated offsets (more than one offset pointing to the same part) and
> parts that have to go into an array in a semantic order which doesn't
> have to be the offset order. So I have to first reorder the offsets to
> read the parts in offset order and than I have to reorder the read parts
> again to get them back into the semantic order. That said - it is still
> possible that the offsets are in fact in the semantic order of the
> parts, but the spec doesn't say this.
> 
> I don't want to depend on RandomAccessFile or FileChannel, because I
> need to be able to test reading of substructures out of byte arrays.

Good decision IMO.

> What I need is an Interface from which I can read bytes and which allows
> multiple relative seeks. With multiple relative seeks I mean something
> like multiple marks. As I wrote this, I implemented such a thing inside
> my DataInputStream. There is now a method:
> 
>     public SkipHandle mark();
> 
> and the SkipHandle class looks like this:
> 
>     public class SkipHandle {
>         
>         private final long relativePos;
> 
>         public void skipTo(long offset);
>     }
> 
> SkipHandle is a non-static inner class of DataInputStream.
> DataInputStream counts the bytes read and skipped to get an idea of its
> actual position. The SkipHandle gets the actual stream position on
> creation so that it is able to skip on DataInputStream relative to its
> creation position. If the skip would be negative, SkipHandle resets the
> whole stream to the start (on creation of DataInputStream, a normal mark
> is set) and skips afterwards.
> 
> It works, but I find it a little but ugly. First I have to set a
> mark(Integer.MAX_VALUE) on DataInputStream creation, because I want
> always be able to reset the whole stream, but I don't have any
> information about how many bytes are on the road. Than I have to disable
> markSupport on my DataInputStream so that nobody kills my own mark.
> 
> But the biggest problem is that DataInputStream has now a non-standard
> mark(), skipTo() API. Its not like a normal FilterInputStream anymore.
> You can't use normal marking, because it's disabled and you have to
> learn this new API instead. 
> 
> Streams simply aren't the right API for reading stuff like OpenType
> files which require massive seeking. But all the seekable API's are
> tight on files. 
> 
> The TTFDataStream API of FontBox is completely custom. I would like to
> avoid such things. 
> 
> So I simply don't know a standard Java API which allows byte reading and
> seeking over an arbitrary source and throws IOExceptions on its methods.
> What about NIO? I don't see any skipping or seeking on channels.

I don't think NIO will help much here. I'd really suggest
ImageInputStream which should have everything you need. You can probably
even reuse some utility code I've written for the image loading
framework:
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageUtil.java?view=markup
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/util/ImageInputStreamAdapter.java?view=markup

The following class has some code to get an ImageInputStream from a URI.
If it's a file URL it tries to get an ImageInputStream with random
access. In all other cases, the content is buffered by ImageIO's default
buffering implementations (depending on the settings).
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/image/loader/impl/AbstractImageSessionContext.java?view=markup
That could might even be extracted to be useful to you.

See also: http://java.sun.com/j2se/1.4.2/docs/api/javax/imageio/ImageIO.html
(methods setUseCache() and setCacheDirectory)

> Any idea is welcome.
> 
> 
> Best Regards
> Alex
>  
> -  
> e-mail: alexanderk...@gmx.net
> web:    www.alexanderkiel.net
> 




Jeremias Maerki

Re: Best Interface for reading OpenType Files

Reply via email to