On Wed, Dec 23, 2015 at 3:30 PM, Carl Marcum <cmar...@apache.org> wrote:
> Hi Damjan, > > Thanks for taking this on. > > See comments below. > > Thank you. > > On 12/22/2015 07:39 AM, Damjan Jovanovic wrote: > >> Hi >> >> tl;dr: I began investigating what Apache POI can offer us. So far it's >> tricky, most parsing is done using DOM parsers so memory usage is high, >> support for the various MS Office app documents varies wildly. It would be >> good for .xlsx output. Patch started, it loads and XLSX is offered in >> Calc's "Save as", now stuck on serializing data because there is no >> public >> UNO API for iterating over only used cells. >> >> The different POI modules vary a lot in terms of how well they support >> OOXML features. Word document support ( >> http://poi.apache.org/document/index.html) is "moderately functional", >> "very strong" for text extraction, but "limited or incomplete" for other >> features, and able to write invalid files, with the page claiming no >> active >> committers for it currently exist, no unified API for doc and docx files, >> and it only supports DOM parsing. PowerPoint document support ( >> http://poi.apache.org/slideshow/index.html) is "in early development", >> has >> no unified API for ppt and pptx, and only does DOM parsing. There is no >> Access support. Visio: read-only low level API, not even a DOM, publisher: >> read-only basic text extraction (do we care?). >> >> Excel document support (http://poi.apache.org/spreadsheet/index.html) is >> far more complete, with a unified API for xls and xlsx files, and both DOM >> and SAX based parsers. The SAX parser for OOXML is disappointing though, >> as >> it provides almost nothing and requires most XML parsing to be done >> manually; its flagship DOM parser isn't built on top of that SAX parser, >> but on top of xmlbeans, so it doesn't look like its parsing logic can be >> reused to make a low-memory SAX-based AOO import filter. What does work >> for >> us is the forward-only SXSSF incremental streaming output API that can >> write very large xlsx files using little memory, and which provides almost >> everything that the DOM API does (cell comments are apparently >> unsupported). >> >> As an aside, we could use DOM parsers to make import filters, but at the >> cost of large amounts of memory, since we have to build POI's DOM, and >> then >> build our own DOM from it. It is possible to incrementally destroy POI's >> DOM as ours is built, by deleting rows and sheets as they get copied to >> our >> DOM, thus reducing memory usage significantly. The same trick cannot be >> used when saving the file, since saving can't destroy the original, but >> that's why we have SXSSF. >> >> Anyway I started writing a filter that exports to .xlsx using Apache POI. >> Our documentation for both filters and UNO leaves much to be desired, the >> error messages produced don't make sense, debugging filters is virtually >> impossible, and EXPORT-only filters seem to get ignored, but copying what >> other filters did, heavily experimenting, lying that it's also an IMPORT >> filter, and with Java being salvation to develop and debug in, I did >> eventually get it to the point of offering the .xlsx format in Calc's >> "Save >> as" dialog. >> >> This always fails, as the actual serialization logic is yet to be written, >> but that's where I am now stuck. To write cells, I have to iterate over >> them. There is 10485768 * 1024 = 2^30 > 1 billion possible cells - per >> sheet - to loop over using >> com.sun.star.table.XCellRange.getCellByPosition() >> or equivalent APIs. Spreadsheets that are densely packed towards the top >> left corner can benefit from the additional use of >> com.sun.star.sheet.XUsedAreaCursor to skip the unused bottommost rows and >> rightmost columns, but a simple 2 cell spreadsheet with A1=1 and >> AMJ1048576=2 will still require iterating over 2^30 cells just to write >> those 2 cells. >> >> This is what several existing export filters do: >> * The HTML export filter (ScHTMLExport::WriteTables() in >> main/sc/source/filter/html/htmlexp.cxx) iterates over all 2^30 cells, >> taking forever and writing an HTML file many gigabytes in size. This is >> unavoidable, as (AFAIK) the HTML file format cannot represent sparse >> tables. >> * The RTF export filter (ScRTFExport::WriteTab() in >> main/sc/source/filter/rtf/rtfexp.cxx) also seems to iterate over all 2^30 >> cells. >> * The Excel export filter (XclExpCellTable::XclExpCellTable() in >> main/sc/source/filter/excel/xetable.cxx) uses an ScUsedAreaIterator to >> only >> iterate over cells that exist. >> * The Lotus export filter (main/sc/source/filter/lotus/expop.cxx) uses an >> ScDocumentIterator to only iterate over cells that exist. >> * The current OOXML filter doesn't export so it never had to face that >> problem. >> * The reference implementation - the ODF export filter >> (ScXMLExport::_ExportContent() in main/sc/source/filter/xml/xmlexprt.cxx) >> uses a custom iterator, ScMyNotEmptyCellsIterator, which is based on >> ScHorizontalCellIterator, to only iterate over cells that exist. >> >> The problem is that all of these iterators seem to be internal C++ APIs >> private to Calc: none are exported over UNO, and none are available to >> other languages, nor can they be implemented in higher level languages >> since they rely on low-level internal Calc data structures :-(. It is >> simply unbelievable that there is no public API in Calc for something as >> fundamental as "loop over all used cells". Is there something I am >> missing? >> Should I add one? >> > Sorry code is Groovy so casts and semicolons need added > > Could you use a range container to collect used ranges... > > com.sun.star.sheet.XSheetCellRangeContainer xRangeCont > =UnoRuntime.queryInterface( > com.sun.star.sheet.XSheetCellRangeContainer.class, > xDocFactory.createInstance("com.sun.star.sheet.SheetCellRanges")) > > // query addresses of all cells containing text > com.sun.star.sheet.XCellRangesQuery xRangesQuery = > UnoRuntime.queryInterface(com.sun.star.sheet.XCellRangesQuery.class, sheet) > > // com.sun.star.sheet.CellFlags constants can be added together for a > total of 1023 > com.sun.star.sheet.XSheetCellRanges xCellRanges = > xRangesQuery.queryContentCells((short)1023) > println("Cells containing ANY: " + xCellRanges.getRangeAddressesAsString()) > > // add xCellRanges to xRangeCont > xRangeCont.addRangeAddresses(xCellRanges.getRangeAddresses(), false) > > // use Enumeration to iterate over cells > com.sun.star.container.XEnumerationAccess xCellsEA = xRangeCont.getCells() > com.sun.star.container.XEnumeration xEnum = xCellsEA.createEnumeration() > while (xEnum.hasMoreElements()) { > Object aCellObj = xEnum.nextElement() > com.sun.star.sheet.XCellAddressable xAddr = > UnoRuntime.queryInterface(com.sun.star.sheet.XCellAddressable.class, > aCellObj) > com.sun.star.table.CellAddress aAddr = xAddr.getCellAddress() > .... > } > > That will probably work, but reading its implementation in ScCellRangesBase::queryContentCells() of main/sc/source/ui/unoobj/cellsuno.cxx, it seems to iterate over all the cells adding them to a range list which it then returns, which means it uses O(n) memory, unlike the iterators which discover cells while iterating, thus using only O(1). Also the order of cells returned needs to be in order from top to bottom, since Apache POI's SXSSF API is forward-only: you can't read or write data in the sheet earlier than a certain maximum number of rows before the last written one. For now I've implemented it by iterating over the minimum used area returned by XUsedAreaCursor, which still has a worst case of 2^30 cells per sheet which takes over 2 hours to save. Currently, text and value cells (without formatting) are successfully saved to .xlsx :-), but I have to fix that bug where EXPORT-only filters get ignored, as the file can't be reopened from the recently used file list, probably because it's trying my filter which can't import but has to lie it can. > Thanks, > Carl > > Damjan
