Hi Jim,

I started off using \R, but that unnecessarily matches \n as well as \r and \r\n.   In this context, we're only looking to replace newlines containing \r. Do you still think using \R would be better?

-- Jon

On 12/17/19 5:00 PM, James Laskey wrote:
Note: you can use the regular expression \R to match line terminators. A bit faster than your pattern. I found this out when testing String::lines() (which was faster still.)

Cheers,

— Jim

On the road.

On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons <[email protected]> wrote:



Updated webrev. This version uses a regular expression and .replaceAll to replace \r\n? with \n in RawHtml. No other changes.

http://cr.openjdk.java.net/~jjg/8236048/webrev.01/

-- Jon


On 12/17/2019 10:44 AM, Jonathan Gibbons wrote:

Inline...


On 12/17/2019 05:51 AM, Pavel Rappo wrote:
Jon,

Could you please explain why "generated files should only contain \n"?

I don't have a robust spec-based answer, so the answer is more along the lines of ...

* because we've always done it this way ... related, back in the day, Apache was the    gold-standard webserver, and was typically running on Linux systems, so favoring that
   config seemd a good idea
* for consistency in the output between user-provided newlines and generated newlines
    ... I did find a spec reference that said to not be inconsistent
* for self-defence in our self-tests, maybe predating back to before Mercurial and before we
  were so rigorous about enforcing Unix-newlines only in our source code


Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization check 
first thing in
normalizeNewlines?

Are you suggesting it would be better to always call normalizeNewlines and so move the check into the method?  I guess I was thinking it was worth checking to detect whether it was
necessary to call the method in the first place.

New mechanics of string normalization is more readable, albeit is probably 
slower in a typical case.
Old mechanics used bulk append, whereas the new one goes char by char. Not sure 
if it is of any
practical significance though.

I don't see why it will be slower in any case.   In all previous cases, we were unconditionally scanning the string character-by-character in Utils.normalizeNewlines and building up a
new StringBuilder, which was then converted to a String.

Now, in StringContent, the scan is merged with the scan to escape the HTML characters. In RawHtml content, the scan/copy is only done if a \r is found.  Yes, we have to do a read-scan but we only copy/update the string if needed, which is going to be faster than the always-copy
that was done before.

One possible suggestion/improvement, ... but I don't know the performance cost

* remove RawHtml.normalizeNewlines completely
* change the RawHtml constructor to a regex, as in
52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : rawHtml.replaceAll("\\R", "\n"); I must admit, I sorta like this, even if it might be a bit slower, to use a regex.

Note, as an aside, we use RawHtml way more than we should. The design intent was that it should be a Content-of-last-resort.  I have ideas on how to significantly reduce the usage, which is the general direction and motivation for a lot of this round of cleanup work.

I wonder if normalization should be deferred until the latest possible moment. 
The reason is that
this operation doesn't seem to be additive, i.e.

     normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B).

My initial thought had been to do the normalization on the fly while writing out the characters, but that would mean writing the file character at a time, instead of bulk-write.  It would also mean a bigger semantic change to leave un-normalized newlines in the data members ... note that there is code to check if items "end with newline" which is done by checking if the last character is DocletConstants.NL.  It also seemed inconsistent to be scanning for HTML characters while leaving newlines alone.   If we were to defer normalization until writing out Content objects, I would consider deferring the <>& translation as well. That might be a good idea but could
be done separately as a later more-localized  changeset.

Note that there is a hidden assumption that newlines are completely represented in each call to a method or constructor for any subtype of Content.   Even if we deferred normalization to the very last possible moment, (i.e. writing out) we would not be able to correctly handle a \r\n pair split across a pair of consecutive Content nodes. I don't
think this is a case worth worrying about.



-Pavel

On 17 Dec 2019, at 02:30, Jonathan Gibbons<[email protected]>  wrote:

Please review a moderately simple change for javadoc, to improve the way that 
newlines in user input (doc comments and values for command-line options) are 
handled, reducing the amount of overall string copying.  The underlying issue 
is that user newlines may be \n, \r or \r\n, but the generated files should 
only contain \n.

Instead of using Utils.normalizeNewlines (which does an unconditional copy) at a number 
of sites, the normalization is now done in StringContent and RawHtml. In the case of 
StringContent, the normalization is done while checking and handling the escape sequences 
for < > &; in the case of RawHtml, it is done when constructing the object, but 
only if the argument contains \r.

There are no good pre-existing tests for testing newline behavior, and so a new 
test is provided that runs javadoc on input containing the different kinds of 
newline.  The test revealed a previously unknown bug, that non-standard 
newslines in command-line option values were not handled correctly.

The output is also the same (before-and-after) for the generated JDK API docs.

-- Jon

JBS:https://bugs.openjdk.java.net/browse/JDK-8236048
Webrev:http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html



Reply via email to