Agreed that we should have a "Guidelines for writing good doc comments"
document, somewhere.
I'll leave it to others to decide if it is in scope for the proposed
Developers Guide project.
-- Jon
On 5/13/20 12:50 PM, Roger Riggs wrote:
Hi Pavel,
I'd suggest that it is in the scope of the proposed Developers Guide
project
to describe how to write specs and documentation for OpenJDK.
Personally, I lean toward the "should" side of things giving developer
leeway to communicate effectively about their APIs.
Roger
On 5/13/20 2:41 PM, Pavel Rappo wrote:
Thanks for chiming in, Roger.
On 13 May 2020, at 18:30, Roger Riggs <[email protected]> wrote:
Hi,
The first sentence is not just any old sentence.
It has a very specific role to play in the javadoc both to introduce
the class, method, feild, etc.
AND to stand independently when used in a summary.
That places a responsibility on the author to craft the sentence for
those purposes.
The author should review their work in the generated javadoc, the
summary tables, etc.
before feeling satisified and moving on.
IMHO the first sentence should be short and to the point and not
include markup or
extra explainatory phrases (such as e.g.).
1. Just to be clear. Does this fall into the "SHOULD" or the "MUST"
category? If the latter, then this MUST be specified. Probably
differently that what we have today in the Documentation Comment
Specification for the Standard Doclet [^1]:
The first sentence of the initial description should be a summary
sentence that contains a concise but complete description of the
declared entity. Descriptive text may include HTML tags and
entities, and inline tags as described below.
If this is the former, then we need more guidance. Perhaps plenty of
examples, including DOs and DON'Ts, as summarizing a complete doc
comment into a single sentence can be challenging. Especially if we
disallow markup, restrict formatting, and disapprove familiar tools,
such as abbreviations, which are freely used in written language.
Come to think of it, if it is that important then we should think of
teaching doclint (or some other tool) to check that.
2. We should think about what to do with doc comments not following
those rules (conventions?) in the OpenJDK codebase.
I don't think the tools should try to be as understanding as
the reader or to compensate for the shortcomings of the author.
Neither do I and I believe I made my position clear in that text.
-Pavel
[^1]:
https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
$.02, Roger
On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
Pavel,
Good write up. You should link to this from 8232447.
-- Jon
On 5/13/20 7:44 AM, Pavel Rappo wrote:
The issue:
https://bugs.openjdk.java.net/browse/JDK-8232447
The more I think about this issue, the less I feel like solving
it. On the one hand, that problem is more complicated than it
looks. On the other hand, solving that problem doesn’t seem to be
that important since it’s about making our best-effort to improve
presentation. I'm leaning towards a solution that is good-enough
(possibly, the one that we already have) or reconsidering the
problem altogether.
Here's what the problem is about. JavaDoc extracts summaries from
doc comments to place them on documentation pages to assist quick
scans by humans (think Table of Contents with descriptive
headings). Since JavaDoc does not understand the meaning of doc
comments, to extract a summary it relies on a convention [^0] that
the first sentence of a doc comment is that doc comment's summary.
The problem is that sometimes JavaDoc gets that first sentence
wrong. For example, according to JavaDoc, the first sentence of
this doc comment for `GraphicsEnvironment.preferProportionalFonts`
[^1]
Indicates a preference for proportional over non-proportional
(e.g. dual-spaced CJK fonts) fonts in the mapping of logical
fonts to physical fonts. If the default mapping contains fonts
for which proportional and non-proportional variants exist, then
calling this method indicates the mapping should use a
proportional variant.
is
Indicates a preference for proportional over non-proportional (e.g.
Now, why does this happen? Unless a more sophisticated mechanism
is requested or the locale's language is not English, JavaDoc uses
a simple "dot-space" algorithm to detect a sentence boundary. That
algorithm scans input from left to right looking for the dot
character followed by a whitespace. While it looks reasonable, in
the above case it is clearly inadequate.
At this point, the reader might say: "Pfft. I know how to fix
this." Please bear with me and I'll show you that the problem is
actually multilayered. Not only does it include a sentence
segmentation algorithm [^2], but input that the algorithm is fed
with, as well as structure and quality of doc comments the input
is created from.
Instead of jumping head-first into augmenting the "dot-space"
algorithm with more heuristics, let's try one more thing. If
instructed to do so or the locale's language is not English,
JavaDoc uses `BreakIterator` [^3]. That `java.text` mechanism is
specifically designed to find various boundaries in text. When
`BreakIterator` is turned on (and after additional tweaking),
JavaDoc gets that first sentence about "proportional fonts" right,
however, other issues show up. Consider the following comment for
`FocusTraversalPolicy.getComponentAfter` [^4]:
Returns the Component that should receive the focus after
aComponent. aContainer must be a focus cycle root of aComponent
or a focus traversal policy provider.
Here `BreakIterator` thinks that the whole paragraph is a single
sentence. This is because in English sentences begin with capital
letters. I should pause here. This is an important moment. While
some doc comments may indeed have typos, irregularities, or
quality issues, that doc comment about "aComponent" has none of
those. It's genuine and consists of easily recognizable by humans
a couple of sentences that do not, however, strictly abide by the
rules of English Grammar. To me, this (and other experiments with
`BreakIterator` I've done) shows that doc comments are not your
regular prose. Unsurprisingly, even a specialized text tool
doesn't grok it. (Which makes me wonder if that was one of the
reasons why `BreakIterator` is turned off by default.) Add
indentation and markup on top of that and you'll see why the
ultimate form that JavaDoc has to work with is not a string but
something like this:
list size = 10
0 = {DCTree$DCStartElement} "<code>"
1 = {DCTree$DCText} "DOMLocator"
2 = {DCTree$DCEndElement} "</code>"
3 = {DCTree$DCText} " is an interface that describes a
location (e.g.\n where an error occurred).\n "
4 = {DCTree$DCStartElement} "<p>"
5 = {DCTree$DCText} "See also the "
6 = {DCTree$DCStartElement} "<a
href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
7 = {DCTree$DCText} "Document Object Model (DOM) Level 3
Core Specification"
8 = {DCTree$DCEndElement} "</a>"
9 = {DCTree$DCText} "."
Continuous text we see on a documentation page [^5] in a browser
comes from a representation such as the above, where the text can
be scattered across various AST nodes. This has interesting
implications. Consider the following doc comment (note the
whitespace after `comment.`):
/** This is the first sentence of this <i>comment. </i> This
is the second sentence. */
Both simple "dot-space" algorithm and `BreakIterator` fail to
extract the first sentence here, producing the exact same result
consisting of both sentences. When `.` is moved immediately after
the closing `</i>`, they both extract the first sentence
correctly. However, the HTML output breaks (note the absence of
closing `</i>`):
<div class="block">This is the first sentence of this
<i>comment.</div>
This is partly because JavaDoc does not interpret HTML. Instead,
it uses a hybrid approach that applies a sentence segmentation
algorithm as an auxiliary step to individual text nodes (not
necessarily the whole text) while maintaining awareness of the
surrounding nodes. The fact that nodes preserve indentation and
formatting of the original doc comment makes things worse, as
whitespace is significant in sentence segmentation. No wonder
JavaDoc hardly sees the forest for the syntax trees! Perhaps, a
more careful way of doing that would be as follows:
1. Interpret markup as text.
2. Apply sentence segmentation to that text to find the first
sentence.
3. Map that first sentence back to markup to accurately
extract the corresponding portion.
But even that won't magically solve all the issues as it's not
possible to decompose an arbitrary markup into independent
components. Consider the following doc comment:
/**
* <table class="comment">
* <tr>
* <td><i>Is this the first sentence?</i></td>
* <td>Is this the second sentence?</td>
* </tr>
* <tr>...</tr>
* </table>
...
Even if we find that "first sentence", can we safely extract it
from its table-context? And all this is just the structure layer
of the problem.
Next layer is ambiguities. Unless extreme measures are taken those
are only resolvable by a human, sometimes by an expert in the area
the documentation relates to. Using abbreviations such as "etc.",
"e.g.", "i.e.", and "vs." is part of the issue. Early guides [^6]
on JavaDoc advised against using abbreviations. While I can see
now one of the reasons for this advice, people use them anyway.
Some might say that abbreviations can be more succinct and
practical. For instance, "etc." is shorter than "and so on", "and
so forth", or "and so on and so forth", and even pronounced
literally as "et cetera" in speech. Non-standard grammar in
abbreviations aggravates the issue. For instance, is "ie" a
misspelt "i.e.", an initialism of Internet Explorer, or a
top-level domain name of The Republic of Ireland? Or is "etc" is a
misspelt "etc." or rather that `/etc` directory from the UNIX
Filesystem Hierarchy Standard? (When scanning OpenJDK repo for
occurrences of "etc." in comments, I found that it can be written
with the number of dots anywhere from 0 to 4. The latter could be
explained as ellipsis `...` followed by a dot `.`, faulty
keyboard, or perhaps a muscle twitch.)
The final layer is typos and low-quality comments. What proportion
of doc comment follow that convention about the first sentence?
What proportion of comments respect grammar or have a meaningful
structure? While we shouldn't aim for a solution that rights the
wrongs of bad comments (i.e. Garbage In, Garbage Out), this is
something to keep in mind:
/**
* this function draws the border around each tab
* note that this function does now draw the background of
the tab.
* that is done elsewhere
...
*/
protected void paintTabBorder(Graphics g, int tabPlacement,
...
There are things we can do to remediate that problem on the doc
comments side of the equation. Reasonable conventions that are
adhered to, better structure of doc comments, or hints. For
example, placing a newline or more than a single whitespace after
the first sentence. Or indicating the summary part of a doc
comment with a relatively new `{@summary}` tag. That said, all of
those might have problems of their own. They are intrusive and
require to re-document the existing code, which is not always
possible. In addition to that, `{@summary}` cannot contain nested
markup, which is quite often used in the summary part. For example
/**
* Returns the runtime class of this {@code Object}. The
returned
* {@code Class} object is the object that is locked by {@code
* static synchronized} methods of the represented class.
...
*/
public final native Class<?> getClass();
or
/**
* An ordered collection (also known as a <i>sequence</i>).
...
*/
public interface List<E> extends Collection<E> { ...
Whatever a solution we choose, there's a risk of playing a
whac-a-mole game. Maybe we should aim for a solution that is
good-enough (possibly, the one that we already have) or reconsider
the problem altogether. For instance, do not extract the first
sentence (unless it can be done reliably). Instead, get the first
N characters and indicate continuation (e.g. using ellipsis
`...`), or use the complete doc-comment, whichever is shorter.
To sum up, extracting sentences from a text written in a natural
language is anything but trivial and might require human
judgement. When done programmatically, occasional mistakes are
inevitable. Doc comments are barely text. While they have some
structure, they also use formatting, code, and markup. Hence,
without pre-processing text tools might not be applicable. Though
JavaDoc could improve its algorithms and doc comments could be
more friendly, what we have today works surprisingly well on the
OpenJDK codebase. If this is not enough, we could find another way
of extracting a summary or eliminate the need for it completely.
That is, change the presentation in such a way that it won't
require summaries.
-Pavel
[^0]:
https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
[^1]:
https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
[^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
[^3]:
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
[^4]:
https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
[^5]:
https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
[^6]:
https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide