https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #1 from Javen O'Neal <one...@apache.org> ---
Out of curiosity, how many styles are you creating that cause the reported
several minutes/10 seconds times? Are any of the styles identical? I ask
because most workbooks I create use under 100 styles. Unless your workbook is a
styles demo that shows every permutation of font, data format, border,
background and foreground color, it seems difficult to get a style count high
enough in most applications where there wouldn't be duplicates.

I've rolled my own code to manage styles so I can avoid creating millions of
styles. Usually I want to change the data format of an existing cell without
affecting other cells that use the same style. Rather than cloning the cell's
style (that's how you get millions of styles), I temporarily change the cell
style's data format, search the style table if there's another style that
matches, and if so change the cell's style reference to the match, otherwise I
clone the style. Finally, I revert my dataformat change to the original style.
Because I apply style changes to thousands of cells, I have some extra data
structures to make the style lookup process faster than linearly searching the
style table.

I mention this because it may solve your problem if you don't really need 1
million styles.

POI could benefit from a way to consolidate duplicate styles, either when the
workbook is written to disk, or through an explicit call.

Glancing at your patch, it looks like your change adds some data structures.
The consequence is:
1) higher memory consumption
2) extra processing power updating multiple data structures
3) potential for the data structures to get out of sync, especially considering
projects that subclass POI.

My recommendation is use a single data structure that is a container for cell
styles, that combines the features that you need that will give fast by-index
and by-style lookup. If such a data structure isn't available off-the-shelf,
you may want to write your own. Most trivially, this is just a class that
contains an ArrayList and a HashMap for inverted array lookups, and any call to
the hybrid data structure would update both underlying data structures.
Encapsulating the complexity is the key to solving my 3rd concern, and makes
solving #2 and #1 easier to solve down the road.

Alternatively, if you can read the styles from the array into a hashmap, clear
the array, and only maintain a hashmap throughout the life of the style table,
and then recreate the array when you need to save the style table, you've saved
yourself the memory and performance overhead, and also avoided the potential
for out-of-sync data structures so long as you clearly mark the array as
unmaintained (maybe clear it or set it to null, or don't make it an instance
variable, plus comments).

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to