Hi tl;dr: I began investigating what Apache POI can offer us. So far it's tricky, most parsing is done using DOM parsers so memory usage is high, support for the various MS Office app documents varies wildly. It would be good for .xlsx output. Patch started, it loads and XLSX is offered in Calc's "Save as", now stuck on serializing data because there is no public UNO API for iterating over only used cells.
The different POI modules vary a lot in terms of how well they support OOXML features. Word document support ( http://poi.apache.org/document/index.html) is "moderately functional", "very strong" for text extraction, but "limited or incomplete" for other features, and able to write invalid files, with the page claiming no active committers for it currently exist, no unified API for doc and docx files, and it only supports DOM parsing. PowerPoint document support ( http://poi.apache.org/slideshow/index.html) is "in early development", has no unified API for ppt and pptx, and only does DOM parsing. There is no Access support. Visio: read-only low level API, not even a DOM, publisher: read-only basic text extraction (do we care?). Excel document support (http://poi.apache.org/spreadsheet/index.html) is far more complete, with a unified API for xls and xlsx files, and both DOM and SAX based parsers. The SAX parser for OOXML is disappointing though, as it provides almost nothing and requires most XML parsing to be done manually; its flagship DOM parser isn't built on top of that SAX parser, but on top of xmlbeans, so it doesn't look like its parsing logic can be reused to make a low-memory SAX-based AOO import filter. What does work for us is the forward-only SXSSF incremental streaming output API that can write very large xlsx files using little memory, and which provides almost everything that the DOM API does (cell comments are apparently unsupported). As an aside, we could use DOM parsers to make import filters, but at the cost of large amounts of memory, since we have to build POI's DOM, and then build our own DOM from it. It is possible to incrementally destroy POI's DOM as ours is built, by deleting rows and sheets as they get copied to our DOM, thus reducing memory usage significantly. The same trick cannot be used when saving the file, since saving can't destroy the original, but that's why we have SXSSF. Anyway I started writing a filter that exports to .xlsx using Apache POI. Our documentation for both filters and UNO leaves much to be desired, the error messages produced don't make sense, debugging filters is virtually impossible, and EXPORT-only filters seem to get ignored, but copying what other filters did, heavily experimenting, lying that it's also an IMPORT filter, and with Java being salvation to develop and debug in, I did eventually get it to the point of offering the .xlsx format in Calc's "Save as" dialog. This always fails, as the actual serialization logic is yet to be written, but that's where I am now stuck. To write cells, I have to iterate over them. There is 10485768 * 1024 = 2^30 > 1 billion possible cells - per sheet - to loop over using com.sun.star.table.XCellRange.getCellByPosition() or equivalent APIs. Spreadsheets that are densely packed towards the top left corner can benefit from the additional use of com.sun.star.sheet.XUsedAreaCursor to skip the unused bottommost rows and rightmost columns, but a simple 2 cell spreadsheet with A1=1 and AMJ1048576=2 will still require iterating over 2^30 cells just to write those 2 cells. This is what several existing export filters do: * The HTML export filter (ScHTMLExport::WriteTables() in main/sc/source/filter/html/htmlexp.cxx) iterates over all 2^30 cells, taking forever and writing an HTML file many gigabytes in size. This is unavoidable, as (AFAIK) the HTML file format cannot represent sparse tables. * The RTF export filter (ScRTFExport::WriteTab() in main/sc/source/filter/rtf/rtfexp.cxx) also seems to iterate over all 2^30 cells. * The Excel export filter (XclExpCellTable::XclExpCellTable() in main/sc/source/filter/excel/xetable.cxx) uses an ScUsedAreaIterator to only iterate over cells that exist. * The Lotus export filter (main/sc/source/filter/lotus/expop.cxx) uses an ScDocumentIterator to only iterate over cells that exist. * The current OOXML filter doesn't export so it never had to face that problem. * The reference implementation - the ODF export filter (ScXMLExport::_ExportContent() in main/sc/source/filter/xml/xmlexprt.cxx) uses a custom iterator, ScMyNotEmptyCellsIterator, which is based on ScHorizontalCellIterator, to only iterate over cells that exist. The problem is that all of these iterators seem to be internal C++ APIs private to Calc: none are exported over UNO, and none are available to other languages, nor can they be implemented in higher level languages since they rely on low-level internal Calc data structures :-(. It is simply unbelievable that there is no public API in Calc for something as fundamental as "loop over all used cells". Is there something I am missing? Should I add one? Thank you Damjan