Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Pavel Rappo Wed, 13 May 2020 14:05:41 -0700

> On 13 May 2020, at 21:59, Jonathan Gibbons <[email protected]> 
> wrote:
> 
> Pavel,
> 
> You can't put block tags before the main body text.  Put another way, each 
> block tag consumes all input that follows up to the next block tag. So, while 
> we could (now) make @summary a bimodal tag, it definitely would NOT work the 
> way you are expecting.


Is it different from what I mentioned right after "On the other hand, I can 
imagine inadvertently introducing another sort of errors, due to unterminated 
content..."? (Just want to understand if I get that right.) Thanks.

> -- Jon
> 
> On 5/13/20 1:49 PM, Pavel Rappo wrote:
>> Jon, here's an idea to ponder. A spin-off of the issue in question. What if 
>> we could mitigate the shortcomings of the {@summary} tag by allowing it to 
>> be a block tag too? I mean can we make it bimodal?
>> 
>>     /**
>>      ...
>>      *
>>      * @summary Returns sqrt(<i>x</i><sup>2</sup>&nbsp;+<i>y</i><sup>2</sup>)
>>      * without intermediate overflow or underflow.
>>      *
>>      ...
>>      * @since 1.5
>>      */
>>     public static double hypot(double x, double y)
>> 
>> If we do that, it could make @summary a complete solution for any case in 
>> the *new* code, no matter how twisted that case is. Authors would get a 
>> better tool for structuring doc comments, an ability to use whatever the 
>> markup or the formatting they want in a summary section, and accurate and 
>> predictable parsing. I guess it would've been considered for JDK-8173425, 
>> have we had bimodal tags back then.
>> 
>> On the other hand, I can imagine inadvertently introducing another sort of 
>> errors, due to unterminated contents:
>> 
>>     /**
>>      * @summary First sentence and the summary of this doc comment.
>>      *
>>      * Second sentence. Third sentence. As you can see, there are no other
>>      * block tags in that doc comment.
>>      */
>>     public void f()
>> 
>> -Pavel
>> 
>>> On 13 May 2020, at 20:01, Jonathan Gibbons <[email protected]> 
>>> wrote:
>>> 
>>> 
>>> On 5/13/20 11:41 AM, Pavel Rappo wrote:
>>>> Thanks for chiming in, Roger.
>>>> 
>>>>> On 13 May 2020, at 18:30, Roger Riggs <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> The first sentence is not just  any old sentence.
>>>>> It has a very specific role to play in the javadoc both to introduce the 
>>>>> class, method, feild, etc.
>>>>> AND to stand independently when used in a summary.
>>>>> That places a responsibility on the author to craft the sentence for 
>>>>> those purposes.
>>>>> The author should review their work in the generated javadoc, the summary 
>>>>> tables, etc.
>>>>> before feeling satisified and moving on.
>>>>> IMHO the first sentence should be short and to the point and not include 
>>>>> markup or
>>>>> extra explainatory phrases (such as e.g.).
>>>> 1. Just to be clear. Does this fall into the "SHOULD" or the "MUST" 
>>>> category? If the latter, then this MUST be specified. Probably differently 
>>>> that what we have today in the Documentation Comment Specification for the 
>>>> Standard Doclet [^1]:
>>> SHOULD, not MUST.
>>>>> The first sentence of the initial description should be a summary 
>>>>> sentence that contains a concise but complete description of the declared 
>>>>> entity. Descriptive text may include HTML tags and entities, and inline 
>>>>> tags as described below.
>>>> If this is the former, then we need more guidance. Perhaps plenty of 
>>>> examples, including DOs and DON'Ts, as summarizing a complete doc comment 
>>>> into a single sentence can be challenging. Especially if we disallow 
>>>> markup, restrict formatting, and disapprove familiar tools, such as 
>>>> abbreviations, which are freely used in written language.
>>>> 
>>>> Come to think of it, if it is that important then we should think of 
>>>> teaching doclint (or some other tool) to check that.
>>> Maybe. doclint was primarily about detecting issues that lead to bad files 
>>> being generated, and less about the style of the content. That's not to say 
>>> we can't change/update the focus, but IMO style is better addressed with 
>>> human processes like reviews and CSR.
>>>> 2. We should think about what to do with doc comments not following those 
>>>> rules (conventions?) in the OpenJDK codebase.
>>>> 
>>>>> I don't think the tools should try to be as understanding as
>>>>> the reader or to compensate for the shortcomings of the author.
>>>> Neither do I and I believe I made my position clear in that text.
>>>> 
>>>> -Pavel
>>>> 
>>>> [^1]: 
>>>> https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
>>>> 
>>>>> $.02, Roger
>>>>> 
>>>>> 
>>>>> On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
>>>>>> Pavel,
>>>>>> 
>>>>>> Good write up.   You should link to this from 8232447.
>>>>>> 
>>>>>> -- Jon
>>>>>> 
>>>>>> On 5/13/20 7:44 AM, Pavel Rappo wrote:
>>>>>>> The issue:
>>>>>>> 
>>>>>>>      https://bugs.openjdk.java.net/browse/JDK-8232447
>>>>>>> 
>>>>>>> The more I think about this issue, the less I feel like solving it. On 
>>>>>>> the one hand, that problem is more complicated than it looks. On the 
>>>>>>> other hand, solving that problem doesn’t seem to be that important 
>>>>>>> since it’s about making our best-effort to improve presentation. I'm 
>>>>>>> leaning towards a solution that is good-enough (possibly, the one that 
>>>>>>> we already have) or reconsidering the problem altogether.
>>>>>>> 
>>>>>>> Here's what the problem is about. JavaDoc extracts summaries from doc 
>>>>>>> comments to place them on documentation pages to assist quick scans by 
>>>>>>> humans (think Table of Contents with descriptive headings). Since 
>>>>>>> JavaDoc does not understand the meaning of doc comments, to extract a 
>>>>>>> summary it relies on a convention [^0] that the first sentence of a doc 
>>>>>>> comment is that doc comment's summary. The problem is that sometimes 
>>>>>>> JavaDoc gets that first sentence wrong. For example, according to 
>>>>>>> JavaDoc, the first sentence of this doc comment for 
>>>>>>> `GraphicsEnvironment.preferProportionalFonts` [^1]
>>>>>>> 
>>>>>>>> Indicates a preference for proportional over non-proportional (e.g. 
>>>>>>>> dual-spaced CJK fonts) fonts in the mapping of logical fonts to 
>>>>>>>> physical fonts. If the default mapping contains fonts for which 
>>>>>>>> proportional and non-proportional variants exist, then calling this 
>>>>>>>> method indicates the mapping should use a proportional variant.
>>>>>>> is
>>>>>>> 
>>>>>>>> Indicates a preference for proportional over non-proportional (e.g.
>>>>>>> Now, why does this happen? Unless a more sophisticated mechanism is 
>>>>>>> requested or the locale's language is not English, JavaDoc uses a 
>>>>>>> simple "dot-space" algorithm to detect a sentence boundary. That 
>>>>>>> algorithm scans input from left to right looking for the dot character 
>>>>>>> followed by a whitespace. While it looks reasonable, in the above case 
>>>>>>> it is clearly inadequate.
>>>>>>> 
>>>>>>> At this point, the reader might say: "Pfft. I know how to fix this." 
>>>>>>> Please bear with me and I'll show you that the problem is actually 
>>>>>>> multilayered. Not only does it include a sentence segmentation 
>>>>>>> algorithm [^2], but input that the algorithm is fed with, as well as 
>>>>>>> structure and quality of doc comments the input is created from.
>>>>>>> 
>>>>>>> Instead of jumping head-first into augmenting the "dot-space" algorithm 
>>>>>>> with more heuristics, let's try one more thing. If instructed to do so 
>>>>>>> or the locale's language is not English, JavaDoc uses `BreakIterator` 
>>>>>>> [^3]. That `java.text` mechanism is specifically designed to find 
>>>>>>> various boundaries in text. When `BreakIterator` is turned on (and 
>>>>>>> after additional tweaking), JavaDoc gets that first sentence about 
>>>>>>> "proportional fonts" right, however, other issues show up. Consider the 
>>>>>>> following comment for `FocusTraversalPolicy.getComponentAfter` [^4]:
>>>>>>> 
>>>>>>>> Returns the Component that should receive the focus after aComponent. 
>>>>>>>> aContainer must be a focus cycle root of aComponent or a focus 
>>>>>>>> traversal policy provider.
>>>>>>> Here `BreakIterator` thinks that the whole paragraph is a single 
>>>>>>> sentence. This is because in English sentences begin with capital 
>>>>>>> letters. I should pause here. This is an important moment. While some 
>>>>>>> doc comments may indeed have typos, irregularities, or quality issues, 
>>>>>>> that doc comment about "aComponent" has none of those. It's genuine and 
>>>>>>> consists of easily recognizable by humans a couple of sentences that do 
>>>>>>> not, however, strictly abide by the rules of English Grammar. To me, 
>>>>>>> this (and other experiments with `BreakIterator` I've done) shows that 
>>>>>>> doc comments are not your regular prose. Unsurprisingly, even a 
>>>>>>> specialized text tool doesn't grok it. (Which makes me wonder if that 
>>>>>>> was one of the reasons why `BreakIterator` is turned off by default.) 
>>>>>>> Add indentation and markup on top of that and you'll see why the 
>>>>>>> ultimate form that JavaDoc has to work with is not a string but 
>>>>>>> something like this:
>>>>>>> 
>>>>>>>      list size = 10
>>>>>>>       0 = {DCTree$DCStartElement} "<code>"
>>>>>>>       1 = {DCTree$DCText} "DOMLocator"
>>>>>>>       2 = {DCTree$DCEndElement} "</code>"
>>>>>>>       3 = {DCTree$DCText} " is an interface that describes a location 
>>>>>>> (e.g.\n where an error occurred).\n "
>>>>>>>       4 = {DCTree$DCStartElement} "<p>"
>>>>>>>       5 = {DCTree$DCText} "See also the "
>>>>>>>       6 = {DCTree$DCStartElement} "<a 
>>>>>>> href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
>>>>>>>       7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 Core 
>>>>>>> Specification"
>>>>>>>       8 = {DCTree$DCEndElement} "</a>"
>>>>>>>       9 = {DCTree$DCText} "."
>>>>>>> 
>>>>>>> Continuous text we see on a documentation page [^5] in a browser comes 
>>>>>>> from a representation such as the above, where the text can be 
>>>>>>> scattered across various AST nodes. This has interesting implications. 
>>>>>>> Consider the following doc comment (note the whitespace after 
>>>>>>> `comment.`):
>>>>>>> 
>>>>>>>      /** This is the first sentence of this <i>comment. </i> This is 
>>>>>>> the second sentence. */
>>>>>>> 
>>>>>>> Both simple "dot-space" algorithm and `BreakIterator` fail to extract 
>>>>>>> the first sentence here, producing the exact same result consisting of 
>>>>>>> both sentences. When `.` is moved immediately after the closing `</i>`, 
>>>>>>> they both extract the first sentence correctly. However, the HTML 
>>>>>>> output breaks (note the absence of closing `</i>`):
>>>>>>> 
>>>>>>>      <div class="block">This is the first sentence of this 
>>>>>>> <i>comment.</div>
>>>>>>> 
>>>>>>> This is partly because JavaDoc does not interpret HTML. Instead, it 
>>>>>>> uses a hybrid approach that applies a sentence segmentation algorithm 
>>>>>>> as an auxiliary step to individual text nodes (not necessarily the 
>>>>>>> whole text) while maintaining awareness of the surrounding nodes. The 
>>>>>>> fact that nodes preserve indentation and formatting of the original doc 
>>>>>>> comment makes things worse, as whitespace is significant in sentence 
>>>>>>> segmentation. No wonder JavaDoc hardly sees the forest for the syntax 
>>>>>>> trees! Perhaps, a more careful way of doing that would be as follows:
>>>>>>> 
>>>>>>>    1. Interpret markup as text.
>>>>>>>    2. Apply sentence segmentation to that text to find the first 
>>>>>>> sentence.
>>>>>>>    3. Map that first sentence back to markup to accurately extract the 
>>>>>>> corresponding portion.
>>>>>>> 
>>>>>>> But even that won't magically solve all the issues as it's not possible 
>>>>>>> to decompose an arbitrary markup into independent components. Consider 
>>>>>>> the following doc comment:
>>>>>>> 
>>>>>>>      /**
>>>>>>>       * <table class="comment">
>>>>>>>       *     <tr>
>>>>>>>       *        <td><i>Is this the first sentence?</i></td>
>>>>>>>       *        <td>Is this the second sentence?</td>
>>>>>>>       *     </tr>
>>>>>>>       *     <tr>...</tr>
>>>>>>>       *  </table>
>>>>>>>       ...
>>>>>>> 
>>>>>>> Even if we find that "first sentence", can we safely extract it from 
>>>>>>> its table-context? And all this is just the structure layer of the 
>>>>>>> problem.
>>>>>>> 
>>>>>>> Next layer is ambiguities. Unless extreme measures are taken those are 
>>>>>>> only resolvable by a human, sometimes by an expert in the area the 
>>>>>>> documentation relates to. Using abbreviations such as "etc.", "e.g.", 
>>>>>>> "i.e.", and "vs." is part of the issue. Early guides [^6] on JavaDoc 
>>>>>>> advised against using abbreviations. While I can see now one of the 
>>>>>>> reasons for this advice, people use them anyway. Some might say that 
>>>>>>> abbreviations can be more succinct and practical. For instance, "etc." 
>>>>>>> is shorter than "and so on", "and so forth", or "and so on and so 
>>>>>>> forth", and even pronounced literally as "et cetera" in speech. 
>>>>>>> Non-standard grammar in abbreviations aggravates the issue. For 
>>>>>>> instance, is "ie" a misspelt "i.e.", an initialism of Internet 
>>>>>>> Explorer, or a top-level domain name of The Republic of Ireland? Or is 
>>>>>>> "etc" is a misspelt "etc." or rather that `/etc` directory from the 
>>>>>>> UNIX Filesystem Hierarchy Standard? (When scanning OpenJDK repo for 
>>>>>>> occurrences of "etc." in comments, I found that it can be written with 
>>>>>>> the number of dots anywhere from 0 to 4. The latter could be explained 
>>>>>>> as ellipsis `...` followed by a dot `.`, faulty keyboard, or perhaps a 
>>>>>>> muscle twitch.)
>>>>>>> 
>>>>>>> The final layer is typos and low-quality comments. What proportion of 
>>>>>>> doc comment follow that convention about the first sentence? What 
>>>>>>> proportion of comments respect grammar or have a meaningful structure? 
>>>>>>> While we shouldn't aim for a solution that rights the wrongs of bad 
>>>>>>> comments (i.e. Garbage In, Garbage Out), this is something to keep in 
>>>>>>> mind:
>>>>>>> 
>>>>>>>      /**
>>>>>>>       * this function draws the border around each tab
>>>>>>>       * note that this function does now draw the background of the tab.
>>>>>>>       * that is done elsewhere
>>>>>>>       ...
>>>>>>>       */
>>>>>>>       protected void paintTabBorder(Graphics g, int tabPlacement, ...
>>>>>>> 
>>>>>>> There are things we can do to remediate that problem on the doc 
>>>>>>> comments side of the equation. Reasonable conventions that are adhered 
>>>>>>> to, better structure of doc comments, or hints. For example, placing a 
>>>>>>> newline or more than a single whitespace after the first sentence. Or 
>>>>>>> indicating the summary part of a doc comment with a relatively new 
>>>>>>> `{@summary}` tag. That said, all of those might have problems of their 
>>>>>>> own. They are intrusive and require to re-document the existing code, 
>>>>>>> which is not always possible. In addition to that, `{@summary}` cannot 
>>>>>>> contain nested markup, which is quite often used in the summary part. 
>>>>>>> For example
>>>>>>> 
>>>>>>>      /**
>>>>>>>       * Returns the runtime class of this {@code Object}. The returned
>>>>>>>       * {@code Class} object is the object that is locked by {@code
>>>>>>>       * static synchronized} methods of the represented class.
>>>>>>>       ...
>>>>>>>       */
>>>>>>>       public final native Class<?> getClass();
>>>>>>>       or
>>>>>>> 
>>>>>>>      /**
>>>>>>>       * An ordered collection (also known as a <i>sequence</i>).
>>>>>>>       ...
>>>>>>>       */
>>>>>>>      public interface List<E> extends Collection<E> { ...
>>>>>>>      Whatever a solution we choose, there's a risk of playing a 
>>>>>>> whac-a-mole game. Maybe we should aim for a solution that is 
>>>>>>> good-enough (possibly, the one that we already have) or reconsider the 
>>>>>>> problem altogether. For instance, do not extract the first sentence 
>>>>>>> (unless it can be done reliably). Instead, get the first N characters 
>>>>>>> and indicate continuation (e.g. using ellipsis `...`), or use the 
>>>>>>> complete doc-comment, whichever is shorter.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> To sum up, extracting sentences from a text written in a natural 
>>>>>>> language is anything but trivial and might require human judgement. 
>>>>>>> When done programmatically, occasional mistakes are inevitable. Doc 
>>>>>>> comments are barely text. While they have some structure, they also use 
>>>>>>> formatting, code, and markup. Hence, without pre-processing text tools 
>>>>>>> might not be applicable. Though JavaDoc could improve its algorithms 
>>>>>>> and doc comments could be more friendly, what we have today works 
>>>>>>> surprisingly well on the OpenJDK codebase. If this is not enough, we 
>>>>>>> could find another way of extracting a summary or eliminate the need 
>>>>>>> for it completely. That is, change the presentation in such a way that 
>>>>>>> it won't require summaries.
>>>>>>> 
>>>>>>> -Pavel
>>>>>>> 
>>>>>>> [^0]: 
>>>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
>>>>>>> [^1]: 
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
>>>>>>> [^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
>>>>>>> [^3]: 
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
>>>>>>> [^4]: 
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
>>>>>>> [^5]: 
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
>>>>>>> [^6]: 
>>>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide
>>>>>>> 
>

Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Reply via email to