Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Jonathan Gibbons Wed, 13 May 2020 14:02:42 -0700

Agreed that we should have a "Guidelines for writing good doc comments"document, somewhere.I'll leave it to others to decide if it is in scope for the proposedDevelopers Guide project.


-- Jon


On 5/13/20 12:50 PM, Roger Riggs wrote:

Hi Pavel,
I'd suggest that it is in the scope of the proposed Developers Guideproject
to describe how to write specs and documentation for OpenJDK.
Personally, I lean toward the "should" side of things giving developer
leeway to communicate effectively about their APIs.

Roger


On 5/13/20 2:41 PM, Pavel Rappo wrote:
Thanks for chiming in, Roger.
On 13 May 2020, at 18:30, Roger Riggs <[email protected]> wrote:

Hi,

The first sentence is not just  any old sentence.
It has a very specific role to play in the javadoc both to introducethe class, method, feild, etc.
AND to stand independently when used in a summary.
That places a responsibility on the author to craft the sentence forthose purposes.The author should review their work in the generated javadoc, thesummary tables, etc.
before feeling satisified and moving on.
IMHO the first sentence should be short and to the point and notinclude markup or
extra explainatory phrases (such as e.g.).
1. Just to be clear. Does this fall into the "SHOULD" or the "MUST"category? If the latter, then this MUST be specified. Probablydifferently that what we have today in the Documentation CommentSpecification for the Standard Doclet [^1]:
The first sentence of the initial description should be a summarysentence that contains a concise but complete description of thedeclared entity. Descriptive text may include HTML tags andentities, and inline tags as described below.
If this is the former, then we need more guidance. Perhaps plenty ofexamples, including DOs and DON'Ts, as summarizing a complete doccomment into a single sentence can be challenging. Especially if wedisallow markup, restrict formatting, and disapprove familiar tools,such as abbreviations, which are freely used in written language.
Come to think of it, if it is that important then we should think ofteaching doclint (or some other tool) to check that.
2. We should think about what to do with doc comments not followingthose rules (conventions?) in the OpenJDK codebase.
I don't think the tools should try to be as understanding as
the reader or to compensate for the shortcomings of the author.
Neither do I and I believe I made my position clear in that text.

-Pavel
[^1]:https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
$.02, Roger


On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
Pavel,

Good write up.   You should link to this from 8232447.

-- Jon

On 5/13/20 7:44 AM, Pavel Rappo wrote:
The issue:

      https://bugs.openjdk.java.net/browse/JDK-8232447
The more I think about this issue, the less I feel like solvingit. On the one hand, that problem is more complicated than itlooks. On the other hand, solving that problem doesn’t seem to bethat important since it’s about making our best-effort to improvepresentation. I'm leaning towards a solution that is good-enough(possibly, the one that we already have) or reconsidering theproblem altogether.
Here's what the problem is about. JavaDoc extracts summaries fromdoc comments to place them on documentation pages to assist quickscans by humans (think Table of Contents with descriptiveheadings). Since JavaDoc does not understand the meaning of doccomments, to extract a summary it relies on a convention [^0] thatthe first sentence of a doc comment is that doc comment's summary.The problem is that sometimes JavaDoc gets that first sentencewrong. For example, according to JavaDoc, the first sentence ofthis doc comment for `GraphicsEnvironment.preferProportionalFonts`[^1]
Indicates a preference for proportional over non-proportional(e.g. dual-spaced CJK fonts) fonts in the mapping of logicalfonts to physical fonts. If the default mapping contains fontsfor which proportional and non-proportional variants exist, thencalling this method indicates the mapping should use aproportional variant.
is
Indicates a preference for proportional over non-proportional (e.g.
Now, why does this happen? Unless a more sophisticated mechanismis requested or the locale's language is not English, JavaDoc usesa simple "dot-space" algorithm to detect a sentence boundary. Thatalgorithm scans input from left to right looking for the dotcharacter followed by a whitespace. While it looks reasonable, inthe above case it is clearly inadequate.
At this point, the reader might say: "Pfft. I know how to fixthis." Please bear with me and I'll show you that the problem isactually multilayered. Not only does it include a sentencesegmentation algorithm [^2], but input that the algorithm is fedwith, as well as structure and quality of doc comments the inputis created from.
Instead of jumping head-first into augmenting the "dot-space"algorithm with more heuristics, let's try one more thing. Ifinstructed to do so or the locale's language is not English,JavaDoc uses `BreakIterator` [^3]. That `java.text` mechanism isspecifically designed to find various boundaries in text. When`BreakIterator` is turned on (and after additional tweaking),JavaDoc gets that first sentence about "proportional fonts" right,however, other issues show up. Consider the following comment for`FocusTraversalPolicy.getComponentAfter` [^4]:
Returns the Component that should receive the focus afteraComponent. aContainer must be a focus cycle root of aComponentor a focus traversal policy provider.
Here `BreakIterator` thinks that the whole paragraph is a singlesentence. This is because in English sentences begin with capitalletters. I should pause here. This is an important moment. Whilesome doc comments may indeed have typos, irregularities, orquality issues, that doc comment about "aComponent" has none ofthose. It's genuine and consists of easily recognizable by humansa couple of sentences that do not, however, strictly abide by therules of English Grammar. To me, this (and other experiments with`BreakIterator` I've done) shows that doc comments are not yourregular prose. Unsurprisingly, even a specialized text tooldoesn't grok it. (Which makes me wonder if that was one of thereasons why `BreakIterator` is turned off by default.) Addindentation and markup on top of that and you'll see why theultimate form that JavaDoc has to work with is not a string butsomething like this:
      list size = 10
       0 = {DCTree$DCStartElement} "<code>"
       1 = {DCTree$DCText} "DOMLocator"
       2 = {DCTree$DCEndElement} "</code>"
3 = {DCTree$DCText} " is an interface that describes alocation (e.g.\n where an error occurred).\n "
       4 = {DCTree$DCStartElement} "<p>"
       5 = {DCTree$DCText} "See also the "
6 = {DCTree$DCStartElement} "<ahref='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>" 7 = {DCTree$DCText} "Document Object Model (DOM) Level 3Core Specification"
       8 = {DCTree$DCEndElement} "</a>"
       9 = {DCTree$DCText} "."
Continuous text we see on a documentation page [^5] in a browsercomes from a representation such as the above, where the text canbe scattered across various AST nodes. This has interestingimplications. Consider the following doc comment (note thewhitespace after `comment.`):
/** This is the first sentence of this <i>comment. </i> Thisis the second sentence. */
Both simple "dot-space" algorithm and `BreakIterator` fail toextract the first sentence here, producing the exact same resultconsisting of both sentences. When `.` is moved immediately afterthe closing `</i>`, they both extract the first sentencecorrectly. However, the HTML output breaks (note the absence ofclosing `</i>`):
<div class="block">This is the first sentence of this<i>comment.</div>
This is partly because JavaDoc does not interpret HTML. Instead,it uses a hybrid approach that applies a sentence segmentationalgorithm as an auxiliary step to individual text nodes (notnecessarily the whole text) while maintaining awareness of thesurrounding nodes. The fact that nodes preserve indentation andformatting of the original doc comment makes things worse, aswhitespace is significant in sentence segmentation. No wonderJavaDoc hardly sees the forest for the syntax trees! Perhaps, amore careful way of doing that would be as follows:
    1. Interpret markup as text.
2. Apply sentence segmentation to that text to find the firstsentence. 3. Map that first sentence back to markup to accuratelyextract the corresponding portion.
But even that won't magically solve all the issues as it's notpossible to decompose an arbitrary markup into independentcomponents. Consider the following doc comment:
      /**
       * <table class="comment">
       *     <tr>
       *        <td><i>Is this the first sentence?</i></td>
       *        <td>Is this the second sentence?</td>
       *     </tr>
       *     <tr>...</tr>
       *  </table>
       ...
Even if we find that "first sentence", can we safely extract itfrom its table-context? And all this is just the structure layerof the problem.
Next layer is ambiguities. Unless extreme measures are taken thoseare only resolvable by a human, sometimes by an expert in the areathe documentation relates to. Using abbreviations such as "etc.","e.g.", "i.e.", and "vs." is part of the issue. Early guides [^6]on JavaDoc advised against using abbreviations. While I can seenow one of the reasons for this advice, people use them anyway.Some might say that abbreviations can be more succinct andpractical. For instance, "etc." is shorter than "and so on", "andso forth", or "and so on and so forth", and even pronouncedliterally as "et cetera" in speech. Non-standard grammar inabbreviations aggravates the issue. For instance, is "ie" amisspelt "i.e.", an initialism of Internet Explorer, or atop-level domain name of The Republic of Ireland? Or is "etc" is amisspelt "etc." or rather that `/etc` directory from the UNIXFilesystem Hierarchy Standard? (When scanning OpenJDK repo foroccurrences of "etc." in comments, I found that it can be writtenwith the number of dots anywhere from 0 to 4. The latter could beexplained as ellipsis `...` followed by a dot `.`, faultykeyboard, or perhaps a muscle twitch.)
The final layer is typos and low-quality comments. What proportionof doc comment follow that convention about the first sentence?What proportion of comments respect grammar or have a meaningfulstructure? While we shouldn't aim for a solution that rights thewrongs of bad comments (i.e. Garbage In, Garbage Out), this issomething to keep in mind:
      /**
       * this function draws the border around each tab
* note that this function does now draw the background ofthe tab.
       * that is done elsewhere
       ...
       */
protected void paintTabBorder(Graphics g, int tabPlacement,...
There are things we can do to remediate that problem on the doccomments side of the equation. Reasonable conventions that areadhered to, better structure of doc comments, or hints. Forexample, placing a newline or more than a single whitespace afterthe first sentence. Or indicating the summary part of a doccomment with a relatively new `{@summary}` tag. That said, all ofthose might have problems of their own. They are intrusive andrequire to re-document the existing code, which is not alwayspossible. In addition to that, `{@summary}` cannot contain nestedmarkup, which is quite often used in the summary part. For example
      /**
* Returns the runtime class of this {@code Object}. Thereturned
       * {@code Class} object is the object that is locked by {@code
       * static synchronized} methods of the represented class.
       ...
       */
       public final native Class<?> getClass();
       or

      /**
       * An ordered collection (also known as a <i>sequence</i>).
       ...
       */
      public interface List<E> extends Collection<E> { ...
Whatever a solution we choose, there's a risk of playing awhac-a-mole game. Maybe we should aim for a solution that isgood-enough (possibly, the one that we already have) or reconsiderthe problem altogether. For instance, do not extract the firstsentence (unless it can be done reliably). Instead, get the firstN characters and indicate continuation (e.g. using ellipsis`...`), or use the complete doc-comment, whichever is shorter.
To sum up, extracting sentences from a text written in a naturallanguage is anything but trivial and might require humanjudgement. When done programmatically, occasional mistakes areinevitable. Doc comments are barely text. While they have somestructure, they also use formatting, code, and markup. Hence,without pre-processing text tools might not be applicable. ThoughJavaDoc could improve its algorithms and doc comments could bemore friendly, what we have today works surprisingly well on theOpenJDK codebase. If this is not enough, we could find another wayof extracting a summary or eliminate the need for it completely.That is, change the presentation in such a way that it won'trequire summaries.
-Pavel
[^0]:https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format[^1]:https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
[^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
[^3]:https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html[^4]:https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)[^5]:https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html[^6]:https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide

Re: Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Reply via email to