> On 13 May 2020, at 21:59, Jonathan Gibbons <[email protected]>
> wrote:
>
> Pavel,
>
> You can't put block tags before the main body text. Put another way, each
> block tag consumes all input that follows up to the next block tag. So, while
> we could (now) make @summary a bimodal tag, it definitely would NOT work the
> way you are expecting.
Is it different from what I mentioned right after "On the other hand, I can
imagine inadvertently introducing another sort of errors, due to unterminated
content..."? (Just want to understand if I get that right.) Thanks.
> -- Jon
>
> On 5/13/20 1:49 PM, Pavel Rappo wrote:
>> Jon, here's an idea to ponder. A spin-off of the issue in question. What if
>> we could mitigate the shortcomings of the {@summary} tag by allowing it to
>> be a block tag too? I mean can we make it bimodal?
>>
>> /**
>> ...
>> *
>> * @summary Returns sqrt(<i>x</i><sup>2</sup> +<i>y</i><sup>2</sup>)
>> * without intermediate overflow or underflow.
>> *
>> ...
>> * @since 1.5
>> */
>> public static double hypot(double x, double y)
>>
>> If we do that, it could make @summary a complete solution for any case in
>> the *new* code, no matter how twisted that case is. Authors would get a
>> better tool for structuring doc comments, an ability to use whatever the
>> markup or the formatting they want in a summary section, and accurate and
>> predictable parsing. I guess it would've been considered for JDK-8173425,
>> have we had bimodal tags back then.
>>
>> On the other hand, I can imagine inadvertently introducing another sort of
>> errors, due to unterminated contents:
>>
>> /**
>> * @summary First sentence and the summary of this doc comment.
>> *
>> * Second sentence. Third sentence. As you can see, there are no other
>> * block tags in that doc comment.
>> */
>> public void f()
>>
>> -Pavel
>>
>>> On 13 May 2020, at 20:01, Jonathan Gibbons <[email protected]>
>>> wrote:
>>>
>>>
>>> On 5/13/20 11:41 AM, Pavel Rappo wrote:
>>>> Thanks for chiming in, Roger.
>>>>
>>>>> On 13 May 2020, at 18:30, Roger Riggs <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> The first sentence is not just any old sentence.
>>>>> It has a very specific role to play in the javadoc both to introduce the
>>>>> class, method, feild, etc.
>>>>> AND to stand independently when used in a summary.
>>>>> That places a responsibility on the author to craft the sentence for
>>>>> those purposes.
>>>>> The author should review their work in the generated javadoc, the summary
>>>>> tables, etc.
>>>>> before feeling satisified and moving on.
>>>>> IMHO the first sentence should be short and to the point and not include
>>>>> markup or
>>>>> extra explainatory phrases (such as e.g.).
>>>> 1. Just to be clear. Does this fall into the "SHOULD" or the "MUST"
>>>> category? If the latter, then this MUST be specified. Probably differently
>>>> that what we have today in the Documentation Comment Specification for the
>>>> Standard Doclet [^1]:
>>> SHOULD, not MUST.
>>>>> The first sentence of the initial description should be a summary
>>>>> sentence that contains a concise but complete description of the declared
>>>>> entity. Descriptive text may include HTML tags and entities, and inline
>>>>> tags as described below.
>>>> If this is the former, then we need more guidance. Perhaps plenty of
>>>> examples, including DOs and DON'Ts, as summarizing a complete doc comment
>>>> into a single sentence can be challenging. Especially if we disallow
>>>> markup, restrict formatting, and disapprove familiar tools, such as
>>>> abbreviations, which are freely used in written language.
>>>>
>>>> Come to think of it, if it is that important then we should think of
>>>> teaching doclint (or some other tool) to check that.
>>> Maybe. doclint was primarily about detecting issues that lead to bad files
>>> being generated, and less about the style of the content. That's not to say
>>> we can't change/update the focus, but IMO style is better addressed with
>>> human processes like reviews and CSR.
>>>> 2. We should think about what to do with doc comments not following those
>>>> rules (conventions?) in the OpenJDK codebase.
>>>>
>>>>> I don't think the tools should try to be as understanding as
>>>>> the reader or to compensate for the shortcomings of the author.
>>>> Neither do I and I believe I made my position clear in that text.
>>>>
>>>> -Pavel
>>>>
>>>> [^1]:
>>>> https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
>>>>
>>>>> $.02, Roger
>>>>>
>>>>>
>>>>> On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
>>>>>> Pavel,
>>>>>>
>>>>>> Good write up. You should link to this from 8232447.
>>>>>>
>>>>>> -- Jon
>>>>>>
>>>>>> On 5/13/20 7:44 AM, Pavel Rappo wrote:
>>>>>>> The issue:
>>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8232447
>>>>>>>
>>>>>>> The more I think about this issue, the less I feel like solving it. On
>>>>>>> the one hand, that problem is more complicated than it looks. On the
>>>>>>> other hand, solving that problem doesn’t seem to be that important
>>>>>>> since it’s about making our best-effort to improve presentation. I'm
>>>>>>> leaning towards a solution that is good-enough (possibly, the one that
>>>>>>> we already have) or reconsidering the problem altogether.
>>>>>>>
>>>>>>> Here's what the problem is about. JavaDoc extracts summaries from doc
>>>>>>> comments to place them on documentation pages to assist quick scans by
>>>>>>> humans (think Table of Contents with descriptive headings). Since
>>>>>>> JavaDoc does not understand the meaning of doc comments, to extract a
>>>>>>> summary it relies on a convention [^0] that the first sentence of a doc
>>>>>>> comment is that doc comment's summary. The problem is that sometimes
>>>>>>> JavaDoc gets that first sentence wrong. For example, according to
>>>>>>> JavaDoc, the first sentence of this doc comment for
>>>>>>> `GraphicsEnvironment.preferProportionalFonts` [^1]
>>>>>>>
>>>>>>>> Indicates a preference for proportional over non-proportional (e.g.
>>>>>>>> dual-spaced CJK fonts) fonts in the mapping of logical fonts to
>>>>>>>> physical fonts. If the default mapping contains fonts for which
>>>>>>>> proportional and non-proportional variants exist, then calling this
>>>>>>>> method indicates the mapping should use a proportional variant.
>>>>>>> is
>>>>>>>
>>>>>>>> Indicates a preference for proportional over non-proportional (e.g.
>>>>>>> Now, why does this happen? Unless a more sophisticated mechanism is
>>>>>>> requested or the locale's language is not English, JavaDoc uses a
>>>>>>> simple "dot-space" algorithm to detect a sentence boundary. That
>>>>>>> algorithm scans input from left to right looking for the dot character
>>>>>>> followed by a whitespace. While it looks reasonable, in the above case
>>>>>>> it is clearly inadequate.
>>>>>>>
>>>>>>> At this point, the reader might say: "Pfft. I know how to fix this."
>>>>>>> Please bear with me and I'll show you that the problem is actually
>>>>>>> multilayered. Not only does it include a sentence segmentation
>>>>>>> algorithm [^2], but input that the algorithm is fed with, as well as
>>>>>>> structure and quality of doc comments the input is created from.
>>>>>>>
>>>>>>> Instead of jumping head-first into augmenting the "dot-space" algorithm
>>>>>>> with more heuristics, let's try one more thing. If instructed to do so
>>>>>>> or the locale's language is not English, JavaDoc uses `BreakIterator`
>>>>>>> [^3]. That `java.text` mechanism is specifically designed to find
>>>>>>> various boundaries in text. When `BreakIterator` is turned on (and
>>>>>>> after additional tweaking), JavaDoc gets that first sentence about
>>>>>>> "proportional fonts" right, however, other issues show up. Consider the
>>>>>>> following comment for `FocusTraversalPolicy.getComponentAfter` [^4]:
>>>>>>>
>>>>>>>> Returns the Component that should receive the focus after aComponent.
>>>>>>>> aContainer must be a focus cycle root of aComponent or a focus
>>>>>>>> traversal policy provider.
>>>>>>> Here `BreakIterator` thinks that the whole paragraph is a single
>>>>>>> sentence. This is because in English sentences begin with capital
>>>>>>> letters. I should pause here. This is an important moment. While some
>>>>>>> doc comments may indeed have typos, irregularities, or quality issues,
>>>>>>> that doc comment about "aComponent" has none of those. It's genuine and
>>>>>>> consists of easily recognizable by humans a couple of sentences that do
>>>>>>> not, however, strictly abide by the rules of English Grammar. To me,
>>>>>>> this (and other experiments with `BreakIterator` I've done) shows that
>>>>>>> doc comments are not your regular prose. Unsurprisingly, even a
>>>>>>> specialized text tool doesn't grok it. (Which makes me wonder if that
>>>>>>> was one of the reasons why `BreakIterator` is turned off by default.)
>>>>>>> Add indentation and markup on top of that and you'll see why the
>>>>>>> ultimate form that JavaDoc has to work with is not a string but
>>>>>>> something like this:
>>>>>>>
>>>>>>> list size = 10
>>>>>>> 0 = {DCTree$DCStartElement} "<code>"
>>>>>>> 1 = {DCTree$DCText} "DOMLocator"
>>>>>>> 2 = {DCTree$DCEndElement} "</code>"
>>>>>>> 3 = {DCTree$DCText} " is an interface that describes a location
>>>>>>> (e.g.\n where an error occurred).\n "
>>>>>>> 4 = {DCTree$DCStartElement} "<p>"
>>>>>>> 5 = {DCTree$DCText} "See also the "
>>>>>>> 6 = {DCTree$DCStartElement} "<a
>>>>>>> href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
>>>>>>> 7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 Core
>>>>>>> Specification"
>>>>>>> 8 = {DCTree$DCEndElement} "</a>"
>>>>>>> 9 = {DCTree$DCText} "."
>>>>>>>
>>>>>>> Continuous text we see on a documentation page [^5] in a browser comes
>>>>>>> from a representation such as the above, where the text can be
>>>>>>> scattered across various AST nodes. This has interesting implications.
>>>>>>> Consider the following doc comment (note the whitespace after
>>>>>>> `comment.`):
>>>>>>>
>>>>>>> /** This is the first sentence of this <i>comment. </i> This is
>>>>>>> the second sentence. */
>>>>>>>
>>>>>>> Both simple "dot-space" algorithm and `BreakIterator` fail to extract
>>>>>>> the first sentence here, producing the exact same result consisting of
>>>>>>> both sentences. When `.` is moved immediately after the closing `</i>`,
>>>>>>> they both extract the first sentence correctly. However, the HTML
>>>>>>> output breaks (note the absence of closing `</i>`):
>>>>>>>
>>>>>>> <div class="block">This is the first sentence of this
>>>>>>> <i>comment.</div>
>>>>>>>
>>>>>>> This is partly because JavaDoc does not interpret HTML. Instead, it
>>>>>>> uses a hybrid approach that applies a sentence segmentation algorithm
>>>>>>> as an auxiliary step to individual text nodes (not necessarily the
>>>>>>> whole text) while maintaining awareness of the surrounding nodes. The
>>>>>>> fact that nodes preserve indentation and formatting of the original doc
>>>>>>> comment makes things worse, as whitespace is significant in sentence
>>>>>>> segmentation. No wonder JavaDoc hardly sees the forest for the syntax
>>>>>>> trees! Perhaps, a more careful way of doing that would be as follows:
>>>>>>>
>>>>>>> 1. Interpret markup as text.
>>>>>>> 2. Apply sentence segmentation to that text to find the first
>>>>>>> sentence.
>>>>>>> 3. Map that first sentence back to markup to accurately extract the
>>>>>>> corresponding portion.
>>>>>>>
>>>>>>> But even that won't magically solve all the issues as it's not possible
>>>>>>> to decompose an arbitrary markup into independent components. Consider
>>>>>>> the following doc comment:
>>>>>>>
>>>>>>> /**
>>>>>>> * <table class="comment">
>>>>>>> * <tr>
>>>>>>> * <td><i>Is this the first sentence?</i></td>
>>>>>>> * <td>Is this the second sentence?</td>
>>>>>>> * </tr>
>>>>>>> * <tr>...</tr>
>>>>>>> * </table>
>>>>>>> ...
>>>>>>>
>>>>>>> Even if we find that "first sentence", can we safely extract it from
>>>>>>> its table-context? And all this is just the structure layer of the
>>>>>>> problem.
>>>>>>>
>>>>>>> Next layer is ambiguities. Unless extreme measures are taken those are
>>>>>>> only resolvable by a human, sometimes by an expert in the area the
>>>>>>> documentation relates to. Using abbreviations such as "etc.", "e.g.",
>>>>>>> "i.e.", and "vs." is part of the issue. Early guides [^6] on JavaDoc
>>>>>>> advised against using abbreviations. While I can see now one of the
>>>>>>> reasons for this advice, people use them anyway. Some might say that
>>>>>>> abbreviations can be more succinct and practical. For instance, "etc."
>>>>>>> is shorter than "and so on", "and so forth", or "and so on and so
>>>>>>> forth", and even pronounced literally as "et cetera" in speech.
>>>>>>> Non-standard grammar in abbreviations aggravates the issue. For
>>>>>>> instance, is "ie" a misspelt "i.e.", an initialism of Internet
>>>>>>> Explorer, or a top-level domain name of The Republic of Ireland? Or is
>>>>>>> "etc" is a misspelt "etc." or rather that `/etc` directory from the
>>>>>>> UNIX Filesystem Hierarchy Standard? (When scanning OpenJDK repo for
>>>>>>> occurrences of "etc." in comments, I found that it can be written with
>>>>>>> the number of dots anywhere from 0 to 4. The latter could be explained
>>>>>>> as ellipsis `...` followed by a dot `.`, faulty keyboard, or perhaps a
>>>>>>> muscle twitch.)
>>>>>>>
>>>>>>> The final layer is typos and low-quality comments. What proportion of
>>>>>>> doc comment follow that convention about the first sentence? What
>>>>>>> proportion of comments respect grammar or have a meaningful structure?
>>>>>>> While we shouldn't aim for a solution that rights the wrongs of bad
>>>>>>> comments (i.e. Garbage In, Garbage Out), this is something to keep in
>>>>>>> mind:
>>>>>>>
>>>>>>> /**
>>>>>>> * this function draws the border around each tab
>>>>>>> * note that this function does now draw the background of the tab.
>>>>>>> * that is done elsewhere
>>>>>>> ...
>>>>>>> */
>>>>>>> protected void paintTabBorder(Graphics g, int tabPlacement, ...
>>>>>>>
>>>>>>> There are things we can do to remediate that problem on the doc
>>>>>>> comments side of the equation. Reasonable conventions that are adhered
>>>>>>> to, better structure of doc comments, or hints. For example, placing a
>>>>>>> newline or more than a single whitespace after the first sentence. Or
>>>>>>> indicating the summary part of a doc comment with a relatively new
>>>>>>> `{@summary}` tag. That said, all of those might have problems of their
>>>>>>> own. They are intrusive and require to re-document the existing code,
>>>>>>> which is not always possible. In addition to that, `{@summary}` cannot
>>>>>>> contain nested markup, which is quite often used in the summary part.
>>>>>>> For example
>>>>>>>
>>>>>>> /**
>>>>>>> * Returns the runtime class of this {@code Object}. The returned
>>>>>>> * {@code Class} object is the object that is locked by {@code
>>>>>>> * static synchronized} methods of the represented class.
>>>>>>> ...
>>>>>>> */
>>>>>>> public final native Class<?> getClass();
>>>>>>> or
>>>>>>>
>>>>>>> /**
>>>>>>> * An ordered collection (also known as a <i>sequence</i>).
>>>>>>> ...
>>>>>>> */
>>>>>>> public interface List<E> extends Collection<E> { ...
>>>>>>> Whatever a solution we choose, there's a risk of playing a
>>>>>>> whac-a-mole game. Maybe we should aim for a solution that is
>>>>>>> good-enough (possibly, the one that we already have) or reconsider the
>>>>>>> problem altogether. For instance, do not extract the first sentence
>>>>>>> (unless it can be done reliably). Instead, get the first N characters
>>>>>>> and indicate continuation (e.g. using ellipsis `...`), or use the
>>>>>>> complete doc-comment, whichever is shorter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> To sum up, extracting sentences from a text written in a natural
>>>>>>> language is anything but trivial and might require human judgement.
>>>>>>> When done programmatically, occasional mistakes are inevitable. Doc
>>>>>>> comments are barely text. While they have some structure, they also use
>>>>>>> formatting, code, and markup. Hence, without pre-processing text tools
>>>>>>> might not be applicable. Though JavaDoc could improve its algorithms
>>>>>>> and doc comments could be more friendly, what we have today works
>>>>>>> surprisingly well on the OpenJDK codebase. If this is not enough, we
>>>>>>> could find another way of extracting a summary or eliminate the need
>>>>>>> for it completely. That is, change the presentation in such a way that
>>>>>>> it won't require summaries.
>>>>>>>
>>>>>>> -Pavel
>>>>>>>
>>>>>>> [^0]:
>>>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
>>>>>>> [^1]:
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
>>>>>>> [^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
>>>>>>> [^3]:
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
>>>>>>> [^4]:
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
>>>>>>> [^5]:
>>>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
>>>>>>> [^6]:
>>>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide
>>>>>>>
>