Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Pavel Rappo Wed, 13 May 2020 07:45:21 -0700

The issue:

    https://bugs.openjdk.java.net/browse/JDK-8232447


The more I think about this issue, the less I feel like solving it. On the one 
hand, that problem is more complicated than it looks. On the other hand, 
solving that problem doesn’t seem to be that important since it’s about making 
our best-effort to improve presentation. I'm leaning towards a solution that is 
good-enough (possibly, the one that we already have) or reconsidering the 
problem altogether.

Here's what the problem is about. JavaDoc extracts summaries from doc comments 
to place them on documentation pages to assist quick scans by humans (think 
Table of Contents with descriptive headings). Since JavaDoc does not understand 
the meaning of doc comments, to extract a summary it relies on a convention 
[^0] that the first sentence of a doc comment is that doc comment's summary. 
The problem is that sometimes JavaDoc gets that first sentence wrong. For 
example, according to JavaDoc, the first sentence of this doc comment for 
`GraphicsEnvironment.preferProportionalFonts` [^1]

> Indicates a preference for proportional over non-proportional (e.g. 
> dual-spaced CJK fonts) fonts in the mapping of logical fonts to physical 
> fonts. If the default mapping contains fonts for which proportional and 
> non-proportional variants exist, then calling this method indicates the 
> mapping should use a proportional variant.

is

> Indicates a preference for proportional over non-proportional (e.g.

Now, why does this happen? Unless a more sophisticated mechanism is requested 
or the locale's language is not English, JavaDoc uses a simple "dot-space" 
algorithm to detect a sentence boundary. That algorithm scans input from left 
to right looking for the dot character followed by a whitespace. While it looks 
reasonable, in the above case it is clearly inadequate.

At this point, the reader might say: "Pfft. I know how to fix this." Please 
bear with me and I'll show you that the problem is actually multilayered. Not 
only does it include a sentence segmentation algorithm [^2], but input that the 
algorithm is fed with, as well as structure and quality of doc comments the 
input is created from.

Instead of jumping head-first into augmenting the "dot-space" algorithm with 
more heuristics, let's try one more thing. If instructed to do so or the 
locale's language is not English, JavaDoc uses `BreakIterator` [^3]. That 
`java.text` mechanism is specifically designed to find various boundaries in 
text. When `BreakIterator` is turned on (and after additional tweaking), 
JavaDoc gets that first sentence about "proportional fonts" right, however, 
other issues show up. Consider the following comment for 
`FocusTraversalPolicy.getComponentAfter` [^4]:

> Returns the Component that should receive the focus after aComponent. 
> aContainer must be a focus cycle root of aComponent or a focus traversal 
> policy provider.

Here `BreakIterator` thinks that the whole paragraph is a single sentence. This 
is because in English sentences begin with capital letters. I should pause 
here. This is an important moment. While some doc comments may indeed have 
typos, irregularities, or quality issues, that doc comment about "aComponent" 
has none of those. It's genuine and consists of easily recognizable by humans a 
couple of sentences that do not, however, strictly abide by the rules of 
English Grammar. To me, this (and other experiments with `BreakIterator` I've 
done) shows that doc comments are not your regular prose. Unsurprisingly, even 
a specialized text tool doesn't grok it. (Which makes me wonder if that was one 
of the reasons why `BreakIterator` is turned off by default.) Add indentation 
and markup on top of that and you'll see why the ultimate form that JavaDoc has 
to work with is not a string but something like this:

    list size = 10
     0 = {DCTree$DCStartElement} "<code>"
     1 = {DCTree$DCText} "DOMLocator"
     2 = {DCTree$DCEndElement} "</code>"
     3 = {DCTree$DCText} " is an interface that describes a location (e.g.\n 
where an error occurred).\n "
     4 = {DCTree$DCStartElement} "<p>"
     5 = {DCTree$DCText} "See also the "
     6 = {DCTree$DCStartElement} "<a 
href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
     7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 Core 
Specification"
     8 = {DCTree$DCEndElement} "</a>"
     9 = {DCTree$DCText} "."

Continuous text we see on a documentation page [^5] in a browser comes from a 
representation such as the above, where the text can be scattered across 
various AST nodes. This has interesting implications. Consider the following 
doc comment (note the whitespace after `comment.`):

    /** This is the first sentence of this <i>comment. </i> This is the second 
sentence. */

Both simple "dot-space" algorithm and `BreakIterator` fail to extract the first 
sentence here, producing the exact same result consisting of both sentences. 
When `.` is moved immediately after the closing `</i>`, they both extract the 
first sentence correctly. However, the HTML output breaks (note the absence of 
closing `</i>`):

    <div class="block">This is the first sentence of this <i>comment.</div>

This is partly because JavaDoc does not interpret HTML. Instead, it uses a 
hybrid approach that applies a sentence segmentation algorithm as an auxiliary 
step to individual text nodes (not necessarily the whole text) while 
maintaining awareness of the surrounding nodes. The fact that nodes preserve 
indentation and formatting of the original doc comment makes things worse, as 
whitespace is significant in sentence segmentation. No wonder JavaDoc hardly 
sees the forest for the syntax trees! Perhaps, a more careful way of doing that 
would be as follows:

  1. Interpret markup as text.
  2. Apply sentence segmentation to that text to find the first sentence.
  3. Map that first sentence back to markup to accurately extract the 
corresponding portion.

But even that won't magically solve all the issues as it's not possible to 
decompose an arbitrary markup into independent components. Consider the 
following doc comment:

    /**
     * <table class="comment">
     *     <tr>
     *        <td><i>Is this the first sentence?</i></td>
     *        <td>Is this the second sentence?</td>
     *     </tr>
     *     <tr>...</tr>
     *  </table>
     ...

Even if we find that "first sentence", can we safely extract it from its 
table-context? And all this is just the structure layer of the problem. 

Next layer is ambiguities. Unless extreme measures are taken those are only 
resolvable by a human, sometimes by an expert in the area the documentation 
relates to. Using abbreviations such as "etc.", "e.g.", "i.e.", and "vs." is 
part of the issue. Early guides [^6] on JavaDoc advised against using 
abbreviations. While I can see now one of the reasons for this advice, people 
use them anyway. Some might say that abbreviations can be more succinct and 
practical. For instance, "etc." is shorter than "and so on", "and so forth", or 
"and so on and so forth", and even pronounced literally as "et cetera" in 
speech. Non-standard grammar in abbreviations aggravates the issue. For 
instance, is "ie" a misspelt "i.e.", an initialism of Internet Explorer, or a 
top-level domain name of The Republic of Ireland? Or is "etc" is a misspelt 
"etc." or rather that `/etc` directory from the UNIX Filesystem Hierarchy 
Standard? (When scanning OpenJDK repo for occurrences of "etc." in comments, I 
found that it can be written with the number of dots anywhere from 0 to 4. The 
latter could be explained as ellipsis `...` followed by a dot `.`, faulty 
keyboard, or perhaps a muscle twitch.)

The final layer is typos and low-quality comments. What proportion of doc 
comment follow that convention about the first sentence? What proportion of 
comments respect grammar or have a meaningful structure? While we shouldn't aim 
for a solution that rights the wrongs of bad comments (i.e. Garbage In, Garbage 
Out), this is something to keep in mind:

    /**
     * this function draws the border around each tab
     * note that this function does now draw the background of the tab.
     * that is done elsewhere
     ...
     */
     protected void paintTabBorder(Graphics g, int tabPlacement, ...

There are things we can do to remediate that problem on the doc comments side 
of the equation. Reasonable conventions that are adhered to, better structure 
of doc comments, or hints. For example, placing a newline or more than a single 
whitespace after the first sentence. Or indicating the summary part of a doc 
comment with a relatively new `{@summary}` tag. That said, all of those might 
have problems of their own. They are intrusive and require to re-document the 
existing code, which is not always possible. In addition to that, `{@summary}` 
cannot contain nested markup, which is quite often used in the summary part. 
For example

    /**
     * Returns the runtime class of this {@code Object}. The returned
     * {@code Class} object is the object that is locked by {@code
     * static synchronized} methods of the represented class.
     ...
     */
     public final native Class<?> getClass();
     
or

    /**
     * An ordered collection (also known as a <i>sequence</i>).
     ...
     */
    public interface List<E> extends Collection<E> { ...
    
Whatever a solution we choose, there's a risk of playing a whac-a-mole game. 
Maybe we should aim for a solution that is good-enough (possibly, the one that 
we already have) or reconsider the problem altogether. For instance, do not 
extract the first sentence (unless it can be done reliably). Instead, get the 
first N characters and indicate continuation (e.g. using ellipsis `...`), or 
use the complete doc-comment, whichever is shorter.




To sum up, extracting sentences from a text written in a natural language is 
anything but trivial and might require human judgement. When done 
programmatically, occasional mistakes are inevitable. Doc comments are barely 
text. While they have some structure, they also use formatting, code, and 
markup. Hence, without pre-processing text tools might not be applicable. 
Though JavaDoc could improve its algorithms and doc comments could be more 
friendly, what we have today works surprisingly well on the OpenJDK codebase. 
If this is not enough, we could find another way of extracting a summary or 
eliminate the need for it completely. That is, change the presentation in such 
a way that it won't require summaries.

-Pavel

[^0]: 
https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
[^1]: 
https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
[^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
[^3]: 
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
[^4]: 
https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
[^5]: 
https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
[^6]: 
https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide

Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Reply via email to