Hi Andy,

It all sounds good.  Do you have much information (or Marc if you're
listening) on how memory efficient POIFS is when reading?  Is there any room
for optimization at that layer?  Probably not since Marc wrote it. :-)

I guess anything that could get our object counts down would be a good
thing.  With a bit of care we might even be able to get a bit more speed
from things.

Regards,

Glen Stampoultzis  (TriNexus Pty Ltd)
Fixed:+61 3 9753-6850     Mob:+61 (0)402 835 458
ICQ:  62722370    EMail:  [EMAIL PROTECTED]
URL's:  http://jakarta.apache.org/poi, http://www.krysalis.org


> Hi Guys,
>
> Didn't plan to go over this yet but I may mention it in my talk on
> Monday and it seemed like I should do so
> here first.
>
> When I came up with the original object model for HSSF I knew it was
> very OOP, in fact I thought it was
> the OOP version of a fully normalized database.  And while I knew that
> would be memory intensive, I
> decided to run with it.  And for good reason.  IMHO, HSSF is really easy
> to debug and add to.
>
> However, HSSF's low level object model is *very* close to Excel's.  And
> for this reason, its not wonderfully
> efficient.  When I profiled HSSF back near the beginning of the year
> (BTW Happy Birthday to POI, its about
> a year old now), the largest object counts were:
>
> 1. CellValues
> 2. Tree Nodes
> 3. Strings
> 4. Chars (probably because of auto-detection)
> 5. bytes
>
> The CellValues are because every cell value is a record object in HSSF.
>  The treenodes are because we store a number of things in trees for fast
> lookups.  The Strings are generally due to the SSTRecord.  The Chars
> seemed to be mostly durring encoding autodetection (they'd spike), and
> the bytes are because we have the original file in memory as a byte
> array, then have to create the new file as a byte array.  If the two
> overlap...a spike occurs.
>
> For HSSF 2.0 we should continue to focus on features, however by midway
> into the 3.0 cycle we'll have most of them flushed out.  (3.0 will
> probably focus on graphics and the such).  For HSSF 3.0 we should start
> focusing on bringing our memory footprint down to size.  We're not the
> only ones with this problem, the same problem exists for PDF apis, etc.
>
> I think we can do this:
>
> 1. Abandon holding CellValues at runtime, We'll still need "row objects"
> that hold these, but it can do so as some manner of parallel array.
>  Meaning it holds the values in an array of doubles.  It holds the type
> in an array of ints and the style int array as well.  For formulas
> another array of strings with the style array to tell it where to look.
>
> 2. Based on what we know we can calculate the size of the resulting
> record without actually having an instance.
>
> 3. At serialization, we'll use a method version of the record to
> serialize the rows.
>
> We'll still have a large chunk of memory taken up by the row objects
> (even primitives take up memory), but by my estimation it will be
> substantially smaller.
>
> RowObject {
>   private double[] data;          // 10.0, 0.0, 100.0, 1.0
>   private int[] type;                // 1, 3, 1, 2            - 3 is for
> formula, 2 = string, 1 = number
>   private int[] style;                // indicies to style records 1-1
basis
>   private String[] formula;      // "SUM(blabla)"
> }
>
> data array contains the elements in order on a 1-1 bases with the cells.
> the type array contains the element types (formula, string, etc) on a
> 1-1 basis with cells
> the style array contians the index to the HSSFCellStyle on a 1-1 basis
> the formula array contains formulas on a 1-1 basis with the set of
> formula-type cells.  So in this case the first formula is the second
> index for type and data but matches to the first formula element.
>
> This structure would also allow us to take advantage of MULRK, MULBLANK
> and other optimization cell types.  It would make implementation of
> String formulas much cleaner (though the above needs some more thought
> to support that).  And I'm betting this would reduce our memory
> consumption by 50-75%.
>
> But this is way way more cryptic.  So it should wait until we've got our
> full feature set.
>
> Thoughts?
>
> -Andy
>
>
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to