https://bugs.documentfoundation.org/show_bug.cgi?id=152143
V Stuart Foote <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |libreoffice-ux-advise@lists | |.freedesktop.org, | |[email protected] Status|UNCONFIRMED |RESOLVED See Also| |https://bugs.documentfounda | |tion.org/show_bug.cgi?id=15 | |1598, | |https://bugs.documentfounda | |tion.org/show_bug.cgi?id=11 | |7428 Keywords| |needsUXEval Resolution|--- |DUPLICATE --- Comment #1 from V Stuart Foote <[email protected]> --- (In reply to Hossein from comment #0) > Description: > Currently it is not possible to export PDF files loaded in LibreOffice > (Draw) to text. Not true. Currently LO has the 'Consolidate text' feature see work done for bug 118370 [1]. Which is functional just inconvenient to move PDF imported text to the Writer canvas for filter export. And this is a dupe of bug 32249, or at most of bug 151598 to implement 'Consolidate text' on the Writer canvas. In reasonable workflow, we now take an imported PDF (opened via Draw) to draw vcl canvas. The textboxes representing the text streams read out from PDF structures are discretely placed onto vcl canvas. So you can already select and consolidate entire pages of imported draw shape textboxes (by glyph index lookup in a ToUinicode CMAP) into a single draw shape textbox--a sentence or paragraph of text. And then select that text, copy it and paste it as needed. Then correct as lexically necessary. Also, because PDF provides no lexical sense to the runs in a document (it is a published presentation format)--the discrete imported draw shape text boxes *must be selected in sequence* for a manual merge. That would remain the case working with draw shape textboxes on the Writer canvas and is a limitation of the published rendering encoded into PDF. PDF provides an /ActualText construct that could be used more effectively than index lookup on a Unicode CMAP. For bug 66597 LibreOffice export filter for PDF /ActualText construct already is in place [2] for PDF creation but only to the grapheme cluster run. Bug 117428 is open to refactor PDF export to provide /ActualText at the word bound. What is unclear is how our poppler PDF import filter(s) would need to be refactored to use the lexical details to load draw shape textboxes with /ActualText--for roundtrip, or import of other sourced PDF. Doing more efficient and high fidelity text extraction from PDF into ODF paragraphs is the end goal of bug 32249. Export of lexically correct word, sentence or paragraph to other document formats then becomes routine export filtering that is already present. =-ref-= [1] https://gerrit.libreoffice.org/c/core/+/75043/ [2] https://gerrit.libreoffice.org/c/core/+/53315/ *** This bug has been marked as a duplicate of bug 32249 *** -- You are receiving this mail because: You are the assignee for the bug.
