Re: Writer and .docx

Damjan Jovanovic Fri, 16 Oct 2020 06:23:24 -0700

On Fri, Oct 16, 2020 at 2:05 PM Dave Fisher <wave4d...@comcast.net> wrote:

> Hi -
>
> Sent from my iPhone
>
> > On Oct 16, 2020, at 4:04 AM, Mechtilde <o...@mechtilde.de> wrote:
> >
> > Hello Joost,
> >
> > I'm very happy to read from you.
> >
> >> Am 16.10.20 um 12:50 schrieb Joost Andrae:
> >> Hi Simon,
> >>
> >> it's an honor to me to see a sign of life of you here. Welcome !
> >>
> >> Instead of user picking here to get users leave from AOO to LO a
> >> developer could create a Java based OOo/LO extension that uses Apache
> >> POI to export OpenDocument type documents to MSXML formats by using the
> >> binary MSO export to export those documents to the MSXML format in
> >> between. Or maybe it's possible to XSL this document format by using
> >> OpenOffice together with Apache POI. Using XSL scripts (in AOO menu item
> >> XML filter settings) to make document conversions is possible within
> OOo.
> >
> > I offer my help to test the implementation. sorry but I'm not a
> > programmer. So we as the project need help from Java programmers to work
> > on it and contribute it.
>
> I’m a PMC Member of Apache POI for over 12 years. My team donated the
> initial PowerPoint support and were involved in the initial support for
> OOXML.
>
> POI is embedded into Apache SOLr and Tika along with commercial products.
> The project took over the dormant XMLBeans project and is releasing a 4.0
> that supports modern Java.
>
> An OSGi bundle of POI will be available in the next release if you build
> from source.
>
> The Tika, POI, and PDFBox projects maintain a large regression corpus
> scraped from the internet using CommonCrawl. I’m sure that this could be
> shared in one way or another.
>
> Regards,
> Dave
>
>
Hi

I did start writing a POI-based OOXML export filter for AOO some years ago
(search the dev mailing list), and got it to the point of being able to
save very basic spreadsheets (no formulas, no formatting, just text and
numbers).

There were several major problems with using POI.

Firstly the code in POI is at various stages of completeness. The legacy
XLS filter is very good, supports SAX parsing, etc. The DOC filter is
minimal and unmaintained. What we would need, the OOXML filter for at least
XLSX, is somewhere in between. AFAIK it only supports DOM parsing, meaning
everything needs to be in memory before it can be written to disk, so a big
spreadsheet could consume gigabytes of RAM during saving, and if you don't
have enough memory free, you can't save!

Also I do use POI at work, and it's outstanding for parsing spreadsheets
(it can even parse some that AOO can't), but it's very memory hungry. A
spreadsheet with 100000 rows consumed 6 GB of RAM, compared to 200 MB in LO
(30 times less). That isn't really POI's fault, Java has too much
per-object overhead and there are a great many objects in a spreadsheet
that big. So DOM + Java really do not add up to efficient memory usage. By
comparison, our current OOXML reading is not only SAX-based, but converts
XML tags to integers for faster comparisons and lower memory usage.

Finally AOO itself had limitations that made developing a filter in Java
difficult. Each sheet in a spreadsheet has 1 billion cells. Obviously only
a minority of these contain data - most are empty. In C++ there are special
iterators that can be used to access only the non-empty cells, but these
are not exposed to UNO, or through it, to Java. The only way to tell which
cells are in use is to iterate over all 1 billion cells (per sheet), which
is hopelessly slow.

Some of these problems can be solved. We can expose the cell iterators over
UNO. The memory usage might not matter that much in practice, and we could
patch POI to do SAX parsing/saving at a later stage. But users expect
fonts, styles, charts, images, custom formats, OLE, pivot tables, VBA
macros, form controls, mathematical formulas, change tracking, etc. all
saved losslessly and 100% compatible with Excel, which doesn't only require
work in the filter, but in the rest of AOO too, and POI probably doesn't
support all of those features either.

I might get back into this next month, especially if others want to
collaborate, but don't expect something generally usable, let alone
Excel-quality XSLX saving, any time soon.

Regards
Damjan

Re: Writer and .docx

Reply via email to