GitHub user isagi-y0ichi created a discussion: Is There a Better Architecture 
Than Per-Formula Ephemeral Workbooks?


We operate a report-generation engine built on top of Apache POI (poi-ooxml
5.x, Java 17). Our workflow is:

  1. An Excel (.xlsx) template is parsed ONCE at registration time using the
     SAX-based XSSFReader to extract all formula cells and their PTG dependency
     graphs (via FormulaParser / Ptg[]).  The dependency graph is serialized
     and cached. we use hacks like sheet name ending with _ex contains formulas 
which are extrapolated through entire columns so special parsing is required

  2. At report-generation time, data from external sources is written into 
per-sheet CSV files on disk. A CSVAccessor
     holds the full CSV content in a List<List<String>> in-memory cache.

  3. Formulas are evaluated in topological order. For each formula node we:
       a. Create a brand-new XSSFWorkbook.
       b. Populate only the cells the formula depends on by reading from the
          CSV cache and calling row.createCell().setCellValue().
       c. Call workbook.getCreationHelper().createFormulaEvaluator()
          .evaluateCellFormula(cell).
       d. Write the result back to the CSV cache.
       e. Call our clearAllData()  method to discard the workbook and all its 
internal structures, freeing
          memory for the next formula evaluation.

  4. After all formula nodes are evaluated, a SXSSFWorkbook (rowsInMemory=1)
     is streamed out and uploaded.

For a template with 10 000 formula cells this means 10 000 XSSFWorkbook
allocations with all their internal CTWorkbook/CTStylesheet XML structures,
each discarded immediately after one formula evaluation.  GC pressure is
enormous and we regularly see long pause times.

For large-range aggregates (SUM(A1:A10000000)) we wrote a custom
FormulaSplitter that breaks the PTG range into chunks:

    SUM(A1:A10000000)
    → SUM(A1:A5000000) evaluated separately, result written to CSV
    → SUM(A5000001:A10000000) evaluated separately, result written to CSV
    → final SUM of the two partial results

This is completely serial, still creates one workbook per chunk, and only
supports a hard-coded set of aggregation functions (SUM, COUNT, SUMIFS,
COUNTIFS, SUBTOTAL). all this to make memory effiecient at the cost of time :( .

Is there a way to evaluate a formula in POI without materialising data into an 
XSSFWorkbook at all while still leveraging evaluation provided by poi or 
decoupling entire formula evaluation is the way ? was thinking of extracting 
the formula ( input will always be an excel cant change that ) convert the 
formulas to a map reduce ? then evaluate the same paralelly across instances 
too atp idk

Any guidance, pointers to internal POI APIs, or alternative architectural
approaches

GitHub link: https://github.com/apache/poi/discussions/1061

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to