https://bugs.documentfoundation.org/show_bug.cgi?id=99015

            Bug ID: 99015
           Summary: FILESAVE: LO XML grows so complex it's not
                    human-editable
           Product: LibreOffice
           Version: 5.1.1.3 release
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: UNCONFIRMED
          Severity: major
          Priority: medium
         Component: Writer
          Assignee: [email protected]
          Reporter: [email protected]

https://bugs.documentfoundation.org/show_bug.cgi?id=90540 reports horribly
complex HTML produced (or XHTML) from LibreOffice.  Every version since I have
reported the problem is basically as bad as the previous.  The HTML produced,
and indeed the XML of the .odt file itself if you unzip it and open it to
examine it, is horribly complex.

Note that the person investigating could not reproduce the complex HTML.  Could
someone please investigate further?  Perhaps it is not every single edit
operation that generates fresh spans and fresh paragraph styles, but something
is certainly causing "text fragmentation" that's reminiscent of Windows'
horrible disk fragmentation which it was famous for (unlike Linux's clever
anti-fragmentation logic).

I'm confident that any LO document which you edit for a while, if you have a
look at the XML used for its save format, will be far, far mopre complex than
it needs to be.

It appears that every(?) edit breaks text into new spans.  Many of these new
spans are assigned a specially-created new paragraph style that's redundant
(identical to the original). So in a long document you end up with thousands of
paragraph styles, even though there may be genuinely only a handful of
different styles. 

The results of this are that:

- If you wish to go into the XML file to workaround some LO bug (e.g. to avoid
https://bugs.documentfoundation.org/show_bug.cgi?id=62603#add_comment),
generally speaking it's not feasible: the text is so broken up into separate
text spans with paragraphs styles defined elsewhere that you can't do any
useful regexp searches or fixes.

- The files are much bigger than they need to be (wasting bandwidth when
transmitted - especially if LO was used to generate HTML pages) as well as
local storage space.

- I believe this would also cause a performance hit within LO, since instead of
having, say, a single paragraph with all the text upon which you're going to do
some operation, and knowing it's genuinely a single style, the text may be
broken up into dozens of separate spans which must be iterated over, no doubt
often needing to check whether the style has changed (when usually it will not
have).

- It confuses many other systems that expect relatively simple XML, HTML, or
XHTML as input - especially since the auto-generated redundant paragraph styles
used in the spans are defined elsewhere in the document, so if you paste just a
selection of text, strange things can happen.

So please look into this matter.  I think there will be a lot iof flow-on
benefits from addressing it; even if it is just to introduce a new edit
operation called "Defragment" (or "optimise" or something else that's less
embarrassing than "Defragment").  Best of all, IMHO, would be if LO used
similar smarts to the ext2 and later filesystems to avoid breaking units of
data into smaller pieces unnecessarily.  E.g., one useful step would be not to
create a new text style until it's needed: set the style to use the current
style until some style change is actually made.  Better still would be not to
split the text into a fresh span unless the style is changed.

Pleas, I beg someone, look into this!

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
Libreoffice-bugs mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs

Reply via email to