Re: Improving POIFS performance

Chris Nokleberg Fri, 25 Jul 2003 09:44:00 -0700

On Fri, Jul 25, 2003 at 08:12:32AM -0400, Andrew C. Oliver wrote:
> Excellent!  This is exactly what I was looking for.  Now can you write up
> some more detail about the interface?  Take a look at
> org.apache.poi.hssf.usermodel.HSSFWorkbook.  You can find the complete
> coupling with POIFS.  Its like 5 lines of code or so.  What would change if
> we coupled it with POIFS2 just enough to make the two work together?


The basic structure is
  Document                            <=> POIFSFileSystem
  abstract PropertyStorage            <=> Entry
  Directory extends PropertyStorage   <=> DirectoryEntry
  Stream extends PropertyStorage      <=> DocumentEntry

Throughout the API a "Seekable" interface is used instead of an
InputStream. I've mentioned this before, it is essentially a combination
of all InputStream and DataInput methods, with a few additional bits for
random access and endianness:

  public interface Seekable
  extends DataInput
  {
      public static final int BIG_ENDIAN = 0;
      public static final int LITTLE_ENDIAN = 1;

      // InputStream methods
      public int available() throws IOException;
      public void close() throws IOException;
      public int read() throws IOException;
      public int read(byte[] b) throws IOException;
      public int read(byte[] b, int off, int len) throws IOException;
      public long skip(long n) throws IOException;
      public void mark(int readlimit);
      public boolean markSupported();
      public void reset() throws IOException;

      // Random access support
      public void seek(long n) throws IOException;
      public void seek(long n, boolean relative) throws IOException;
      public long position() throws IOException;
      public long size() throws IOException;

      // Endianness
      public int order();
      public void order(int order);

      // Method to complete DataInput's set of unsigned methods
      public long readUnsignedInt() throws IOException;
  }

Then there are a bunch of implementations:
  SeekableInputStream
  SeekableFile
  SeekableFileChannel
  SeekableByteArray
  BufferedSeekable
  etc.

As you might expect, SeekableInputStream needs to buffer the entire
InputStream to support the Seekable interface. It is a little bit smart
in that it will only read as far as you have seeked, but in general
you'll be better off using something that natively supports random
access.

Seekable is used both as the argument to the Document constructor, but
each individual entry in the OLE structure (Stream) also makes its data
available as a Seekable. Currently HSSFWorkbook just uses an
InputStream, but to reap the full benefits it should eventually
transition to use a Seekable as well (leaving as much unbuffered as
possible).

Now I'll walk through the diff (against the release 2 branch).

Changes in imports:

  - import org.apache.poi.poifs.filesystem.DirectoryEntry;
  - import org.apache.poi.poifs.filesystem.DocumentEntry;
  - import org.apache.poi.poifs.filesystem.DocumentInputStream;
  - import org.apache.poi.poifs.filesystem.Entry;
  - import org.apache.poi.poifs.filesystem.POIFSFileSystem;

  + import com.tonicsystems.generic.io.SeekableInputStream;
  + import com.tonicsystems.generic.ole.Document;
  + import com.tonicsystems.generic.ole.Stream;
  + import com.tonicsystems.generic.ole.StreamWriter;

Find the document/stream named "Workbook" and get an InputStream:

  - InputStream stream = fs.createDocumentInputStream("Workbook");
  + InputStream stream = fs.findStream("Workbook").getInputStream();

New Documents take an object implementing Seekable as an argument. Here
we pass it a SeekableInputStream:

  - this(new POIFSFileSystem(s), preserveNodes);
  + this(new Document(new SeekableInputStream(s)), preserveNodes);

The write method is the most changed. In the existing version, a new
document is always created, and then if preserveNodes is set all the
nodes other than "Workbook" are copied over:

      public void write(OutputStream stream)
              throws IOException
      {
!         byte[] bytes = getBytes();
!         POIFSFileSystem fs = new POIFSFileSystem();
!       
!         fs.createDocument(new ByteArrayInputStream(bytes), "Workbook");
! 
!         if (preserveNodes) { 
!             List excepts = new ArrayList(1);
!             excepts.add("Workbook");
!             copyNodes(this.poifs,fs,excepts);
          }
!         fs.writeFilesystem(stream);
      }

In the patched version, the existing Document object is used (if there
is one), and the contents of the "Workbook" stream are simply replaced
with the new data. As a result the copyNoes, isInList, and
copyNodeRecursively methods disappear.

There are two methods to set the data of a Stream. One is to call
setData(Seekable). The other is to call setStreamWriter(int size,
StreamWriter), which is what is used here. Essentially it is a callback
that will "ask" for you to spit out the stream data at the appropriate
time. You must supply the size ahead of time so that it can do all the
necessary bookkeeping:

      public void write(OutputStream stream)
              throws IOException
      {
!         final byte[] bytes = getBytes();
! 
!         Document fs = poifs;
!         if (fs == null || !preserveNodes) {
!             fs = new Document();
!             fs.getRoot().add(new Stream("Workbook"));
          }
! 
!         fs.findStream("Workbook").setStreamWriter(bytes.length, new StreamWriter() {
!                 public void writeStream(OutputStream os) throws IOException {
!                     os.write(bytes);
!                 }
!             });
!                                                      
!         fs.write(stream);
      }

You could improve memory use here by not building up the byte[] array
ahead of time. Instead, just calculate the size, and then in the
StreamWriter spit out each sheet on the fly.

> What would change to make it fully take advantage of POIFS2?

Mostly just making HSSF take a Seekable instead of an InputStream, and
only read in the bits that you actually need. I'm sure there are some
pieces (images, etc.) that you don't need to slurp in--instead, just
mark their position in the Seekable and transfer the bytes when it
actually comes time to output the document.

> How is POIFS2 in terms of JavaDoc?  Unit Tests?

Could be better :-)

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Improving POIFS performance

Reply via email to