[Bug 165931] Regular expressions must be able to match non-break line endings

bugzilla-daemon Fri, 28 Mar 2025 18:30:53 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=165931


László Németh <nem...@numbertext.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
           Priority|medium                      |low
             Status|UNCONFIRMED                 |ASSIGNED
           Severity|normal                      |enhancement
           Assignee|libreoffice-b...@lists.free |nem...@numbertext.org
                   |desktop.org                 |

--- Comment #12 from László Németh <nem...@numbertext.org> ---
@Eyal: thanks for your suggestion and persistence! I found more attractive
examples and an opportunity for extension, see the end of my comment (after my
long hesitation). 

(In reply to Eyal Rozenberg from comment #7)
> (In reply to László Németh from comment #5)
> > Regex library search operates on the plain text conversion of the document,
> 
> First, the presumption to treat a document as a sequence of sequences of
> (paragraph-level) plain text characters - is itself a bug. A document is not
> that. Other office suites, like MSO for example, do not presume to limit
> searches to plain-text representations.

Any known implementation helps to prove the legitimacy of an enhancement. I've
tried to check Adobe InDesign's GREP regexes, which is based on the boost
library. MSO uses/used a more simplified regex-like pattern matching, I haven't
checked it, yet.

There is a clear requirement for matching the end (and start) of the lines in
the case of optical margin. I've added something in Linux Libertine G, using
Graphite's regex-like pattern matching, but that was – a possible – solution
for a very special problem.

> 
> Would you rather we also filed that as an underlying bug which blocks many
> of the feature requests?
> 
> > Fortunately there are possible solutions or workarounds: etc. etc.
> 
> Laszlo, with respect - suggestions involving complicated procedures,
> certainly exporting PDFs and working outside the app, are not a way that
> LibreOffice "works for you". There are workarounds to lots of bugs - that
> does not invalidate them.

The problem is that my most difficult solution is still less complicated, that
the possible core implementation of the proposed feature, which would break
rules, i.e. normal behavior of regex search by mixing different
*standards*/areas of Writer core, as Mike pointed out. Designing and
implementing a new regex search, which cannot support replace, or adding a new
layer, which operates the layout update during the replace is an unaffordable
price for the recently [i.e. before changing my mind :] known
importance/interoperability of this feature.

On the other hand, UNO's XLineCursor is a perfect API for this and much more. I
almost gave a ready solution for a macro solution, which can be the core of an
add-on. An experienced LibreOffice add-on developer can finish it within a few
days.

> 
> Given what you've said, please consider confirming.

My pleasure. Moreover, I just realized that the layout regex with an extended
syntax seems highly useful for my upcoming typography developments: adding 4
different regex layout boundary marks for 1) line end, 2) column end 3) page
end and 4) spread end, and a 5th mark for the hyphenation, e.g. (maybe used by
ICU regex) \L, \C, \P, \S and \H. Their usage in the regex pattern in
Find&Replace enables the layout text export automatically instead of the recent
one (document model).

For example:

1) Search too short last paragraph lines (ten or less characters – a real
typographical problem):

\L[^\L]{1,10}\n

2) Search top short hyphenation at the end of the pages (also a real
typographical problem)

\b\w{1,2}\H\P

3) Search for five or more consecutive hyphenated lines (also an issue):

([^\L\C]*\H){5,}

4) Select (and count) the hyphenated lines:

[^\L\P]*\H 

It's not clear yet, this is the best way to solve my problems, but I like the
freedom it would give us to analyse and adjust typography.

"The best fixes are the ones you get for free by fixing something else :-)" :)
(https://bugs.documentfoundation.org/show_bug.cgi?id=108025#c18).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 165931] Regular expressions must be able to match non-break line endings

Reply via email to