Apache POI, .xlsx export filter, and missing used cell iteration APIs in Calc

Damjan Jovanovic Tue, 22 Dec 2015 04:40:31 -0800

Hi

tl;dr: I began investigating what Apache POI can offer us. So far it's
tricky, most parsing is done using DOM parsers so memory usage is high,
support for the various MS Office app documents varies wildly. It would be
good for .xlsx output. Patch started, it loads and XLSX is offered in
Calc's "Save  as", now stuck on serializing data because there is no public
UNO API for iterating over only used cells.


The different POI modules vary a lot in terms of how well they support
OOXML features. Word document support (
http://poi.apache.org/document/index.html) is "moderately functional",
"very strong" for text extraction, but "limited or incomplete" for other
features, and able to write invalid files, with the page claiming no active
committers for it currently exist, no unified API for doc and docx files,
and it only supports DOM parsing. PowerPoint document support (
http://poi.apache.org/slideshow/index.html) is "in early development", has
no unified API for ppt and pptx, and only does DOM parsing. There is no
Access support. Visio: read-only low level API, not even a DOM, publisher:
read-only basic text extraction (do we care?).

Excel document support (http://poi.apache.org/spreadsheet/index.html) is
far more complete, with a unified API for xls and xlsx files, and both DOM
and SAX based parsers. The SAX parser for OOXML is disappointing though, as
it provides almost nothing and requires most XML parsing to be done
manually; its flagship DOM parser isn't built on top of that SAX parser,
but on top of xmlbeans, so it doesn't look like its parsing logic can be
reused to make a low-memory SAX-based AOO import filter. What does work for
us is the forward-only SXSSF incremental streaming output API that can
write very large xlsx files using little memory, and which provides almost
everything that the DOM API does (cell comments are apparently unsupported).

As an aside, we could use DOM parsers to make import filters, but at the
cost of large amounts of memory, since we have to build POI's DOM, and then
build our own DOM from it. It is possible to incrementally destroy POI's
DOM as ours is built, by deleting rows and sheets as they get copied to our
DOM, thus reducing memory usage significantly. The same trick cannot be
used when saving the file, since saving can't destroy the original, but
that's why we have SXSSF.

Anyway I started writing a filter that exports to .xlsx using Apache POI.
Our documentation for both filters and UNO leaves much to be desired, the
error messages produced don't make sense, debugging filters is virtually
impossible, and EXPORT-only filters seem to get ignored, but copying what
other filters did, heavily experimenting, lying that it's also an IMPORT
filter, and with Java being salvation to develop and debug in, I did
eventually get it to the point of offering the .xlsx format in Calc's "Save
as" dialog.

This always fails, as the actual serialization logic is yet to be written,
but that's where I am now stuck. To write cells, I have to iterate over
them. There is 10485768 * 1024 = 2^30 > 1 billion possible cells - per
sheet - to loop over using com.sun.star.table.XCellRange.getCellByPosition()
or equivalent APIs. Spreadsheets that are densely packed towards the top
left corner can benefit from the additional use of
com.sun.star.sheet.XUsedAreaCursor to skip the unused bottommost rows and
rightmost columns, but a simple 2 cell spreadsheet with A1=1 and
AMJ1048576=2 will still require iterating over 2^30 cells just to write
those 2 cells.

This is what several existing export filters do:
* The HTML export filter (ScHTMLExport::WriteTables() in
main/sc/source/filter/html/htmlexp.cxx) iterates over all 2^30 cells,
taking forever and writing an HTML file many gigabytes in size. This is
unavoidable, as (AFAIK) the HTML file format cannot represent sparse tables.
* The RTF export filter (ScRTFExport::WriteTab() in
main/sc/source/filter/rtf/rtfexp.cxx) also seems to iterate over all 2^30
cells.
* The Excel export filter (XclExpCellTable::XclExpCellTable() in
main/sc/source/filter/excel/xetable.cxx) uses an ScUsedAreaIterator to only
iterate over cells that exist.
* The Lotus export filter (main/sc/source/filter/lotus/expop.cxx) uses an
ScDocumentIterator to only iterate over cells that exist.
* The current OOXML filter doesn't export so it never had to face that
problem.
* The reference implementation - the ODF export filter
(ScXMLExport::_ExportContent() in main/sc/source/filter/xml/xmlexprt.cxx)
uses a custom iterator, ScMyNotEmptyCellsIterator, which is based on
ScHorizontalCellIterator, to only iterate over cells that exist.

The problem is that all of these iterators seem to be internal C++ APIs
private to Calc: none are exported over UNO, and none are available to
other languages, nor can they be implemented in higher level languages
since they rely on low-level internal Calc data structures :-(. It is
simply unbelievable that there is no public API in Calc for something as
fundamental as "loop over all used cells". Is there something I am missing?
Should I add one?

Thank you
Damjan

Apache POI, .xlsx export filter, and missing used cell iteration APIs in Calc

Reply via email to