Re: RFR: 8346118: Improve whitespace normalization in preformatted text

Hannes Wallnöfer Wed, 05 Mar 2025 12:49:17 -0800

On Mon, 3 Mar 2025 16:41:18 GMT, Hannes Wallnöfer <[email protected]> wrote:


> Please review an enhancement to make `DocCommentParser` normalize whitespace 
> inside `<pre>` elements. The normalization is conceptually simple and and 
> intended to be minimally invasive. Before parsing, `DocCommentParser` checks 
> whether the text is a traditional doc comment and whether every line starts 
> with a space character, which is commonly the case in traditional doc 
> comments. If so, a single leading space is removed in block content (top 
> level text and `{@code}`/`{@literal}` tags) when parsing within HTML `<pre>` 
> tags.
> 
> This fixes the incidental one-space indentation in the vast majority of JDK 
> code samples using `<pre>` alone or in combination with `<code>` or 
> `{@code}`. In fact, I only found one code sample in JDK code that isn't 
> solved by this change, for which I included a fix in this PR (it's in 
> `String.startsWith(String, int)`, where I replaced the 10 char indentation 
> and trailing line with a `<blockquote>`). 
> 
> The many added `boolean inBlockContent` arguments pased around in 
> `DocCommentParser` are to make sure the removal is not applied to multiline 
> inline content, which is maybe a bit fussy considering there is not a lot of 
> multiline inline content in `<pre>` tags and it usually would not mind about 
> removal of a non-essential space character, but I wanted to keep the change 
> minimal. There are few javadoc tests that had to be adapted, most of the 
> testing is done in `test/langtools/tools/javac/doctree`. 
> 
> If the exact number of leading whitespace in `<pre>` tags is important to any 
> javadoc user the old output can be restored by increasing the indentation by 
> 1. There will be a release note for this of course. 
> 
> Unfortunately, there is another whitespace problem that can't be solved as 
> easily, and that is a leading blank line caused by `<pre><code>\n` open tags. 
> Browsers will [ignore a newline immediately following a `<pre>` tag][1], but 
> not if there is a `<code>` tag in between. There are hundreds of occurrences 
> of this in JDK code, including variants with space characters mixed in. The 
> fix in javadoc proper would be too complex, so I decided to solve it with 3 
> lines of JavaScript and a regex to reverse the order of `<code>\n` at the 
> beginning of `<pre>` tags while removing any intermediary space. Script 
> operation is indiscernible and it solves the problem.
> 
> [1]: https://html.spec.whatwg.org/#the-pre-element:the-pre-element

Thanks for reviewing this, Jon. I agree there are two issues that maybe shold 
be handled separately. 

Regarding the first issue, stripping leading space in `DocCommentParser`, I did 
in fact start out processing the whole doc comment. I was pretty far into 
fixing tests when I realized that this approach breaks our ability to get a 
diagnostic position from a string extracted from the comment. The source 
postion mapping is generated in `JavadocScanner`, and it would be quite complex 
to keep track of whitespace removed after the fact and correctly translate to a 
source position. 

The test that made me realize this problem was [TestDocTreeDiags.java], which 
tests reporting positions inside strings extracted from the comment (added in 
[JDK-8268420]) , but with wholesale whitespace removal also start positions are 
affected. I should have mentioned this in the PR description, but it was 
already long enough as it was.

[TestDocTreeDiags.java]: 
https://github.com/openjdk/jdk/blob/master/test/langtools/jdk/javadoc/doclet/testDocTreeDiags/TestDocTreeDiags.java
[JDK-8268420]: https://bugs.openjdk.org/browse/JDK-8268420

This is why I reverted to the proposed minimalistic solution, limiting space 
removal to text and `{@code}`/`{@literal}` tags within `<pre>` elements. These 
are all doctrees that are passed though without further processing, so it is 
highly unlikely that anybody would ever want to report a source position inside 
of them. The begin postions will be correct since we're parsing the unchanged 
doc comment, but positions inside the string after the first line break would 
report a wrong position. This is why I filed [a CSR to amend the spec for above 
mentioned method][CSR] with a note about this. 

[CSR]: https://bugs.openjdk.org/browse/JDK-8350428

Regarding the second issue, I agree the JavaScript solution is not ideal. But 
it's a small price to pay in order to get readable and good looking 
documentation. We have close to 1000 instances of `<pre>{@code}` in JDK source 
code alone, and they will never get fixed and continue to haunt our otherwise 
great documentation. The same problem affects most Java libraries out there. 
You see I'm quite passionate about this, but I agree to move it to a separate 
issue if you think that's the best way to handle this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23868#issuecomment-2702033467

Re: RFR: 8346118: Improve whitespace normalization in preformatted text

Reply via email to