Hi Guys,

Didn't plan to go over this yet but I may mention it in my talk on 
Monday and it seemed like I should do so
here first.

When I came up with the original object model for HSSF I knew it was 
very OOP, in fact I thought it was
the OOP version of a fully normalized database.  And while I knew that 
would be memory intensive, I
decided to run with it.  And for good reason.  IMHO, HSSF is really easy 
to debug and add to.

However, HSSF's low level object model is *very* close to Excel's.  And 
for this reason, its not wonderfully
efficient.  When I profiled HSSF back near the beginning of the year 
(BTW Happy Birthday to POI, its about
a year old now), the largest object counts were:

1. CellValues
2. Tree Nodes
3. Strings
4. Chars (probably because of auto-detection)
5. bytes

The CellValues are because every cell value is a record object in HSSF. 
 The treenodes are because we store a number of things in trees for fast 
lookups.  The Strings are generally due to the SSTRecord.  The Chars 
seemed to be mostly durring encoding autodetection (they'd spike), and 
the bytes are because we have the original file in memory as a byte 
array, then have to create the new file as a byte array.  If the two 
overlap...a spike occurs.

For HSSF 2.0 we should continue to focus on features, however by midway 
into the 3.0 cycle we'll have most of them flushed out.  (3.0 will 
probably focus on graphics and the such).  For HSSF 3.0 we should start 
focusing on bringing our memory footprint down to size.  We're not the 
only ones with this problem, the same problem exists for PDF apis, etc.  

I think we can do this:

1. Abandon holding CellValues at runtime, We'll still need "row objects" 
that hold these, but it can do so as some manner of parallel array. 
 Meaning it holds the values in an array of doubles.  It holds the type 
in an array of ints and the style int array as well.  For formulas 
another array of strings with the style array to tell it where to look.

2. Based on what we know we can calculate the size of the resulting 
record without actually having an instance.  

3. At serialization, we'll use a method version of the record to 
serialize the rows.

We'll still have a large chunk of memory taken up by the row objects 
(even primitives take up memory), but by my estimation it will be 
substantially smaller.

RowObject {
  private double[] data;          // 10.0, 0.0, 100.0, 1.0
  private int[] type;                // 1, 3, 1, 2            - 3 is for 
formula, 2 = string, 1 = number
  private int[] style;                // indicies to style records 1-1 basis
  private String[] formula;      // "SUM(blabla)"
}

data array contains the elements in order on a 1-1 bases with the cells.
the type array contains the element types (formula, string, etc) on a 
1-1 basis with cells
the style array contians the index to the HSSFCellStyle on a 1-1 basis
the formula array contains formulas on a 1-1 basis with the set of 
formula-type cells.  So in this case the first formula is the second
index for type and data but matches to the first formula element.

This structure would also allow us to take advantage of MULRK, MULBLANK 
and other optimization cell types.  It would make implementation of 
String formulas much cleaner (though the above needs some more thought 
to support that).  And I'm betting this would reduce our memory 
consumption by 50-75%.

But this is way way more cryptic.  So it should wait until we've got our 
full feature set.

Thoughts?

-Andy


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to