GitHub user isagi-y0ichi created a discussion: Is There a Better Architecture
Than Per-Formula Ephemeral Workbooks?
We operate a report-generation engine built on top of Apache POI (poi-ooxml
5.x, Java 17). Our workflow is:
1. An Excel (.xlsx) template is parsed ONCE at registration time using the
SAX-based XSSFReader to extract all formula cells and their PTG dependency
graphs (via FormulaParser / Ptg[]). The dependency graph is serialized
and cached. we use hacks like sheet name ending with _ex contains formulas
which are extrapolated through entire columns so special parsing is required
2. At report-generation time, data from external sources is written into
per-sheet CSV files on disk. A CSVAccessor
holds the full CSV content in a List<List<String>> in-memory cache.
3. Formulas are evaluated in topological order. For each formula node we:
a. Create a brand-new XSSFWorkbook.
b. Populate only the cells the formula depends on by reading from the
CSV cache and calling row.createCell().setCellValue().
c. Call workbook.getCreationHelper().createFormulaEvaluator()
.evaluateCellFormula(cell).
d. Write the result back to the CSV cache.
e. Call our clearAllData() method to discard the workbook and all its
internal structures, freeing
memory for the next formula evaluation.
4. After all formula nodes are evaluated, a SXSSFWorkbook (rowsInMemory=1)
is streamed out and uploaded.
For a template with 10 000 formula cells this means 10 000 XSSFWorkbook
allocations with all their internal CTWorkbook/CTStylesheet XML structures,
each discarded immediately after one formula evaluation. GC pressure is
enormous and we regularly see long pause times.
For large-range aggregates (SUM(A1:A10000000)) we wrote a custom
FormulaSplitter that breaks the PTG range into chunks:
SUM(A1:A10000000)
→ SUM(A1:A5000000) evaluated separately, result written to CSV
→ SUM(A5000001:A10000000) evaluated separately, result written to CSV
→ final SUM of the two partial results
This is completely serial, still creates one workbook per chunk, and only
supports a hard-coded set of aggregation functions (SUM, COUNT, SUMIFS,
COUNTIFS, SUBTOTAL). all this to make memory effiecient at the cost of time :( .
Is there a way to evaluate a formula in POI without materialising data into an
XSSFWorkbook at all while still leveraging evaluation provided by poi or
decoupling entire formula evaluation is the way ? was thinking of extracting
the formula ( input will always be an excel cant change that ) convert the
formulas to a map reduce ? then evaluate the same paralelly across instances
too atp idk
Any guidance, pointers to internal POI APIs, or alternative architectural
approaches
GitHub link: https://github.com/apache/poi/discussions/1061
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]