Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Pavel Rappo Wed, 13 May 2020 13:51:42 -0700

Jon, here's an idea to ponder. A spin-off of the issue in question. What if we 
could mitigate the shortcomings of the {@summary} tag by allowing it to be a 
block tag too? I mean can we make it bimodal?


    /**
     ...
     *
     * @summary Returns sqrt(<i>x</i><sup>2</sup>&nbsp;+<i>y</i><sup>2</sup>)
     * without intermediate overflow or underflow.
     *
     ...
     * @since 1.5
     */
    public static double hypot(double x, double y)

If we do that, it could make @summary a complete solution for any case in the 
*new* code, no matter how twisted that case is. Authors would get a better tool 
for structuring doc comments, an ability to use whatever the markup or the 
formatting they want in a summary section, and accurate and predictable 
parsing. I guess it would've been considered for JDK-8173425, have we had 
bimodal tags back then.

On the other hand, I can imagine inadvertently introducing another sort of 
errors, due to unterminated contents:

    /**
     * @summary First sentence and the summary of this doc comment.
     *
     * Second sentence. Third sentence. As you can see, there are no other
     * block tags in that doc comment.
     */
    public void f()

-Pavel

> On 13 May 2020, at 20:01, Jonathan Gibbons <[email protected]> 
> wrote:
> 
> 
> On 5/13/20 11:41 AM, Pavel Rappo wrote:
>> Thanks for chiming in, Roger.
>> 
>>> On 13 May 2020, at 18:30, Roger Riggs <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> The first sentence is not just  any old sentence.
>>> It has a very specific role to play in the javadoc both to introduce the 
>>> class, method, feild, etc.
>>> AND to stand independently when used in a summary.
>>> That places a responsibility on the author to craft the sentence for those 
>>> purposes.
>>> The author should review their work in the generated javadoc, the summary 
>>> tables, etc.
>>> before feeling satisified and moving on.
>>> IMHO the first sentence should be short and to the point and not include 
>>> markup or
>>> extra explainatory phrases (such as e.g.).
>> 1. Just to be clear. Does this fall into the "SHOULD" or the "MUST" 
>> category? If the latter, then this MUST be specified. Probably differently 
>> that what we have today in the Documentation Comment Specification for the 
>> Standard Doclet [^1]:
> SHOULD, not MUST.
>> 
>>> The first sentence of the initial description should be a summary sentence 
>>> that contains a concise but complete description of the declared entity. 
>>> Descriptive text may include HTML tags and entities, and inline tags as 
>>> described below.
>> If this is the former, then we need more guidance. Perhaps plenty of 
>> examples, including DOs and DON'Ts, as summarizing a complete doc comment 
>> into a single sentence can be challenging. Especially if we disallow markup, 
>> restrict formatting, and disapprove familiar tools, such as abbreviations, 
>> which are freely used in written language.
>> 
>> Come to think of it, if it is that important then we should think of 
>> teaching doclint (or some other tool) to check that.
> Maybe. doclint was primarily about detecting issues that lead to bad files 
> being generated, and less about the style of the content. That's not to say 
> we can't change/update the focus, but IMO style is better addressed with 
> human processes like reviews and CSR.
>> 
>> 2. We should think about what to do with doc comments not following those 
>> rules (conventions?) in the OpenJDK codebase.
>> 
>>> I don't think the tools should try to be as understanding as
>>> the reader or to compensate for the shortcomings of the author.
>> Neither do I and I believe I made my position clear in that text.
>> 
>> -Pavel
>> 
>> [^1]: 
>> https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
>> 
>>> $.02, Roger
>>> 
>>> 
>>> On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
>>>> Pavel,
>>>> 
>>>> Good write up.   You should link to this from 8232447.
>>>> 
>>>> -- Jon
>>>> 
>>>> On 5/13/20 7:44 AM, Pavel Rappo wrote:
>>>>> The issue:
>>>>> 
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8232447
>>>>> 
>>>>> The more I think about this issue, the less I feel like solving it. On 
>>>>> the one hand, that problem is more complicated than it looks. On the 
>>>>> other hand, solving that problem doesn’t seem to be that important since 
>>>>> it’s about making our best-effort to improve presentation. I'm leaning 
>>>>> towards a solution that is good-enough (possibly, the one that we already 
>>>>> have) or reconsidering the problem altogether.
>>>>> 
>>>>> Here's what the problem is about. JavaDoc extracts summaries from doc 
>>>>> comments to place them on documentation pages to assist quick scans by 
>>>>> humans (think Table of Contents with descriptive headings). Since JavaDoc 
>>>>> does not understand the meaning of doc comments, to extract a summary it 
>>>>> relies on a convention [^0] that the first sentence of a doc comment is 
>>>>> that doc comment's summary. The problem is that sometimes JavaDoc gets 
>>>>> that first sentence wrong. For example, according to JavaDoc, the first 
>>>>> sentence of this doc comment for 
>>>>> `GraphicsEnvironment.preferProportionalFonts` [^1]
>>>>> 
>>>>>> Indicates a preference for proportional over non-proportional (e.g. 
>>>>>> dual-spaced CJK fonts) fonts in the mapping of logical fonts to physical 
>>>>>> fonts. If the default mapping contains fonts for which proportional and 
>>>>>> non-proportional variants exist, then calling this method indicates the 
>>>>>> mapping should use a proportional variant.
>>>>> is
>>>>> 
>>>>>> Indicates a preference for proportional over non-proportional (e.g.
>>>>> Now, why does this happen? Unless a more sophisticated mechanism is 
>>>>> requested or the locale's language is not English, JavaDoc uses a simple 
>>>>> "dot-space" algorithm to detect a sentence boundary. That algorithm scans 
>>>>> input from left to right looking for the dot character followed by a 
>>>>> whitespace. While it looks reasonable, in the above case it is clearly 
>>>>> inadequate.
>>>>> 
>>>>> At this point, the reader might say: "Pfft. I know how to fix this." 
>>>>> Please bear with me and I'll show you that the problem is actually 
>>>>> multilayered. Not only does it include a sentence segmentation algorithm 
>>>>> [^2], but input that the algorithm is fed with, as well as structure and 
>>>>> quality of doc comments the input is created from.
>>>>> 
>>>>> Instead of jumping head-first into augmenting the "dot-space" algorithm 
>>>>> with more heuristics, let's try one more thing. If instructed to do so or 
>>>>> the locale's language is not English, JavaDoc uses `BreakIterator` [^3]. 
>>>>> That `java.text` mechanism is specifically designed to find various 
>>>>> boundaries in text. When `BreakIterator` is turned on (and after 
>>>>> additional tweaking), JavaDoc gets that first sentence about 
>>>>> "proportional fonts" right, however, other issues show up. Consider the 
>>>>> following comment for `FocusTraversalPolicy.getComponentAfter` [^4]:
>>>>> 
>>>>>> Returns the Component that should receive the focus after aComponent. 
>>>>>> aContainer must be a focus cycle root of aComponent or a focus traversal 
>>>>>> policy provider.
>>>>> Here `BreakIterator` thinks that the whole paragraph is a single 
>>>>> sentence. This is because in English sentences begin with capital 
>>>>> letters. I should pause here. This is an important moment. While some doc 
>>>>> comments may indeed have typos, irregularities, or quality issues, that 
>>>>> doc comment about "aComponent" has none of those. It's genuine and 
>>>>> consists of easily recognizable by humans a couple of sentences that do 
>>>>> not, however, strictly abide by the rules of English Grammar. To me, this 
>>>>> (and other experiments with `BreakIterator` I've done) shows that doc 
>>>>> comments are not your regular prose. Unsurprisingly, even a specialized 
>>>>> text tool doesn't grok it. (Which makes me wonder if that was one of the 
>>>>> reasons why `BreakIterator` is turned off by default.) Add indentation 
>>>>> and markup on top of that and you'll see why the ultimate form that 
>>>>> JavaDoc has to work with is not a string but something like this:
>>>>> 
>>>>>      list size = 10
>>>>>       0 = {DCTree$DCStartElement} "<code>"
>>>>>       1 = {DCTree$DCText} "DOMLocator"
>>>>>       2 = {DCTree$DCEndElement} "</code>"
>>>>>       3 = {DCTree$DCText} " is an interface that describes a location 
>>>>> (e.g.\n where an error occurred).\n "
>>>>>       4 = {DCTree$DCStartElement} "<p>"
>>>>>       5 = {DCTree$DCText} "See also the "
>>>>>       6 = {DCTree$DCStartElement} "<a 
>>>>> href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
>>>>>       7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 Core 
>>>>> Specification"
>>>>>       8 = {DCTree$DCEndElement} "</a>"
>>>>>       9 = {DCTree$DCText} "."
>>>>> 
>>>>> Continuous text we see on a documentation page [^5] in a browser comes 
>>>>> from a representation such as the above, where the text can be scattered 
>>>>> across various AST nodes. This has interesting implications. Consider the 
>>>>> following doc comment (note the whitespace after `comment.`):
>>>>> 
>>>>>      /** This is the first sentence of this <i>comment. </i> This is the 
>>>>> second sentence. */
>>>>> 
>>>>> Both simple "dot-space" algorithm and `BreakIterator` fail to extract the 
>>>>> first sentence here, producing the exact same result consisting of both 
>>>>> sentences. When `.` is moved immediately after the closing `</i>`, they 
>>>>> both extract the first sentence correctly. However, the HTML output 
>>>>> breaks (note the absence of closing `</i>`):
>>>>> 
>>>>>      <div class="block">This is the first sentence of this 
>>>>> <i>comment.</div>
>>>>> 
>>>>> This is partly because JavaDoc does not interpret HTML. Instead, it uses 
>>>>> a hybrid approach that applies a sentence segmentation algorithm as an 
>>>>> auxiliary step to individual text nodes (not necessarily the whole text) 
>>>>> while maintaining awareness of the surrounding nodes. The fact that nodes 
>>>>> preserve indentation and formatting of the original doc comment makes 
>>>>> things worse, as whitespace is significant in sentence segmentation. No 
>>>>> wonder JavaDoc hardly sees the forest for the syntax trees! Perhaps, a 
>>>>> more careful way of doing that would be as follows:
>>>>> 
>>>>>    1. Interpret markup as text.
>>>>>    2. Apply sentence segmentation to that text to find the first sentence.
>>>>>    3. Map that first sentence back to markup to accurately extract the 
>>>>> corresponding portion.
>>>>> 
>>>>> But even that won't magically solve all the issues as it's not possible 
>>>>> to decompose an arbitrary markup into independent components. Consider 
>>>>> the following doc comment:
>>>>> 
>>>>>      /**
>>>>>       * <table class="comment">
>>>>>       *     <tr>
>>>>>       *        <td><i>Is this the first sentence?</i></td>
>>>>>       *        <td>Is this the second sentence?</td>
>>>>>       *     </tr>
>>>>>       *     <tr>...</tr>
>>>>>       *  </table>
>>>>>       ...
>>>>> 
>>>>> Even if we find that "first sentence", can we safely extract it from its 
>>>>> table-context? And all this is just the structure layer of the problem.
>>>>> 
>>>>> Next layer is ambiguities. Unless extreme measures are taken those are 
>>>>> only resolvable by a human, sometimes by an expert in the area the 
>>>>> documentation relates to. Using abbreviations such as "etc.", "e.g.", 
>>>>> "i.e.", and "vs." is part of the issue. Early guides [^6] on JavaDoc 
>>>>> advised against using abbreviations. While I can see now one of the 
>>>>> reasons for this advice, people use them anyway. Some might say that 
>>>>> abbreviations can be more succinct and practical. For instance, "etc." is 
>>>>> shorter than "and so on", "and so forth", or "and so on and so forth", 
>>>>> and even pronounced literally as "et cetera" in speech. Non-standard 
>>>>> grammar in abbreviations aggravates the issue. For instance, is "ie" a 
>>>>> misspelt "i.e.", an initialism of Internet Explorer, or a top-level 
>>>>> domain name of The Republic of Ireland? Or is "etc" is a misspelt "etc." 
>>>>> or rather that `/etc` directory from the UNIX Filesystem Hierarchy 
>>>>> Standard? (When scanning OpenJDK repo for occurrences of "etc." in 
>>>>> comments, I found that it can be written with the number of dots anywhere 
>>>>> from 0 to 4. The latter could be explained as ellipsis `...` followed by 
>>>>> a dot `.`, faulty keyboard, or perhaps a muscle twitch.)
>>>>> 
>>>>> The final layer is typos and low-quality comments. What proportion of doc 
>>>>> comment follow that convention about the first sentence? What proportion 
>>>>> of comments respect grammar or have a meaningful structure? While we 
>>>>> shouldn't aim for a solution that rights the wrongs of bad comments (i.e. 
>>>>> Garbage In, Garbage Out), this is something to keep in mind:
>>>>> 
>>>>>      /**
>>>>>       * this function draws the border around each tab
>>>>>       * note that this function does now draw the background of the tab.
>>>>>       * that is done elsewhere
>>>>>       ...
>>>>>       */
>>>>>       protected void paintTabBorder(Graphics g, int tabPlacement, ...
>>>>> 
>>>>> There are things we can do to remediate that problem on the doc comments 
>>>>> side of the equation. Reasonable conventions that are adhered to, better 
>>>>> structure of doc comments, or hints. For example, placing a newline or 
>>>>> more than a single whitespace after the first sentence. Or indicating the 
>>>>> summary part of a doc comment with a relatively new `{@summary}` tag. 
>>>>> That said, all of those might have problems of their own. They are 
>>>>> intrusive and require to re-document the existing code, which is not 
>>>>> always possible. In addition to that, `{@summary}` cannot contain nested 
>>>>> markup, which is quite often used in the summary part. For example
>>>>> 
>>>>>      /**
>>>>>       * Returns the runtime class of this {@code Object}. The returned
>>>>>       * {@code Class} object is the object that is locked by {@code
>>>>>       * static synchronized} methods of the represented class.
>>>>>       ...
>>>>>       */
>>>>>       public final native Class<?> getClass();
>>>>>       or
>>>>> 
>>>>>      /**
>>>>>       * An ordered collection (also known as a <i>sequence</i>).
>>>>>       ...
>>>>>       */
>>>>>      public interface List<E> extends Collection<E> { ...
>>>>>      Whatever a solution we choose, there's a risk of playing a 
>>>>> whac-a-mole game. Maybe we should aim for a solution that is good-enough 
>>>>> (possibly, the one that we already have) or reconsider the problem 
>>>>> altogether. For instance, do not extract the first sentence (unless it 
>>>>> can be done reliably). Instead, get the first N characters and indicate 
>>>>> continuation (e.g. using ellipsis `...`), or use the complete 
>>>>> doc-comment, whichever is shorter.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> To sum up, extracting sentences from a text written in a natural language 
>>>>> is anything but trivial and might require human judgement. When done 
>>>>> programmatically, occasional mistakes are inevitable. Doc comments are 
>>>>> barely text. While they have some structure, they also use formatting, 
>>>>> code, and markup. Hence, without pre-processing text tools might not be 
>>>>> applicable. Though JavaDoc could improve its algorithms and doc comments 
>>>>> could be more friendly, what we have today works surprisingly well on the 
>>>>> OpenJDK codebase. If this is not enough, we could find another way of 
>>>>> extracting a summary or eliminate the need for it completely. That is, 
>>>>> change the presentation in such a way that it won't require summaries.
>>>>> 
>>>>> -Pavel
>>>>> 
>>>>> [^0]: 
>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
>>>>> [^1]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
>>>>> [^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
>>>>> [^3]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
>>>>> [^4]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
>>>>> [^5]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
>>>>> [^6]: 
>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide
>>>>> 
>

Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Reply via email to