cowwoc opened a new issue, #951:
URL: https://github.com/apache/poi/issues/951
# Feature Request: Add XSSFOptimiser (XSSF equivalent of HSSFOptimiser)
## Summary
HSSFOptimiser provides methods to optimise cell styles and fonts in HSSF
(xls) workbooks by removing duplicates and unused entries. There is no
equivalent for XSSF (xlsx) workbooks. This feature request proposes adding
XSSFOptimiser to provide similar optimization capabilities for XSSF workbooks.
## Problem Description
When programmatically creating or modifying XSSF workbooks, it's easy to
accidentally create "style explosion" - hundreds or thousands of duplicate cell
styles and fonts. This happens when code creates new styles for each cell
instead of reusing existing styles.
For example, this common pattern causes style explosion:
```java
for (int i = 0; i < 1000; i++) {
CellStyle style = workbook.createCellStyle(); // Creates new style each
time!
Font font = workbook.createFont();
font.setBold(true);
style.setFont(font);
cell.setCellStyle(style);
}
```
Style explosion causes:
- Slow file opening in Excel
- Excel hangs or crashes with large workbooks
- Increased file size
- The famous "Too many different cell formats" error
HSSFOptimiser solves this for HSSF workbooks. No equivalent exists for XSSF.
## Proposed Solution
Add `XSSFOptimiser` class in `org.apache.poi.xssf.usermodel` package with
methods:
1. `optimiseCellStyles(XSSFWorkbook)` - Remaps cells using duplicate styles
to canonical (first) occurrence
2. `optimiseFonts(XSSFWorkbook)` - Updates style font references to
canonical fonts
### Implementation Notes
Unlike HSSF, XSSF's StylesTable maintains a separate Java list from the
underlying CTStylesheet XML. Removing entries from CTStylesheet can cause
`XmlValueDisconnectedException` on save because the Java objects become
disconnected from their XML backing.
The proposed implementation takes a **safe remapping approach**:
- Identifies duplicate styles/fonts by comparing all properties
- Remaps cells to use canonical (first occurrence) styles
- Does NOT remove entries from XML (avoids disconnection issues)
- Prevents further style explosion
This approach is safe and effective:
- Workbooks save correctly without XML exceptions
- Cells reference deduplicated styles
- Further modifications won't create additional duplicates if using
canonical styles
## Usage Example
```java
XSSFWorkbook workbook = new XSSFWorkbook(new
FileInputStream("bloated.xlsx"));
// Optimize fonts first, then styles
XSSFOptimiser.optimiseFonts(workbook);
XSSFOptimiser.optimiseCellStyles(workbook);
workbook.write(new FileOutputStream("optimized.xlsx"));
```
## Test Coverage
The implementation includes comprehensive tests:
- `testOptimiseCellStylesRemapsDuplicates` - Verifies duplicate styles are
consolidated
- `testOptimiseCellStylesHandlesUnusedStyles` - Verifies no exception on
unused styles
- `testOptimiseCellStylesPreservesDifferentStyles` - Verifies distinct
styles preserved
- `testOptimiseFontsRemapsDuplicates` - Verifies duplicate fonts are detected
- `testWorkbookSaveableAfterOptimization` - Verifies no save exceptions
- `testOptimizationPreservesFormulas` - Verifies formulas and formatting
preserved
## Contribution
I have a working implementation ready to contribute as a PR. The code:
- Follows POI coding conventions
- Includes Apache License headers
- Has full test coverage
- Is modeled after HSSFOptimiser
Would this contribution be welcome? Happy to submit a PR if so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]