Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Jonathan Gibbons Wed, 18 Dec 2019 06:40:03 -0800

Hi Jim,

I started off using \R, but that unnecessarily matches \n as well as \rand \r\n. In this context, we're only looking to replace newlinescontaining \r. Do you still think using \R would be better?


-- Jon

On 12/17/19 5:00 PM, James Laskey wrote:

Note: you can use the regular expression \R to match line terminators.A bit faster than your pattern. I found this out when testingString::lines() (which was faster still.)
Cheers,

— Jim

On the road.
On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons<[email protected]> wrote:
Updated webrev. This version uses a regular expression and.replaceAll to replace \r\n? with \n in RawHtml. No other changes.
http://cr.openjdk.java.net/~jjg/8236048/webrev.01/

-- Jon


On 12/17/2019 10:44 AM, Jonathan Gibbons wrote:
Inline...


On 12/17/2019 05:51 AM, Pavel Rappo wrote:
Jon,

Could you please explain why "generated files should only contain \n"?
I don't have a robust spec-based answer, so the answer is more alongthe lines of ...
* because we've always done it this way ... related, back in theday, Apache was the gold-standard webserver, and was typically running on Linuxsystems, so favoring that
   config seemd a good idea
* for consistency in the output between user-provided newlines andgenerated newlines
    ... I did find a spec reference that said to not be inconsistent
* for self-defence in our self-tests, maybe predating back to beforeMercurial and before we
  were so rigorous about enforcing Unix-newlines only in our source code
Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization check 
first thing in
normalizeNewlines?
Are you suggesting it would be better to always callnormalizeNewlines and so move thecheck into the method? I guess I was thinking it was worth checkingto detect whether it was
necessary to call the method in the first place.
New mechanics of string normalization is more readable, albeit is probably 
slower in a typical case.
Old mechanics used bulk append, whereas the new one goes char by char. Not sure 
if it is of any
practical significance though.
I don't see why it will be slower in any case. In all previouscases, we were unconditionallyscanning the string character-by-character inUtils.normalizeNewlines and building up a
new StringBuilder, which was then converted to a String.
Now, in StringContent, the scan is merged with the scan to escapethe HTML characters.In RawHtml content, the scan/copy is only done if a \r is found. Yes, we have to do a read-scanbut we only copy/update the string if needed, which is going to befaster than the always-copy
that was done before.
One possible suggestion/improvement, ... but I don't know theperformance cost
* remove RawHtml.normalizeNewlines completely
* change the RawHtml constructor to a regex, as in
52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml :rawHtml.replaceAll("\\R", "\n");I must admit, I sorta like this, even if it might be a bit slower,to use a regex.
Note, as an aside, we use RawHtml way more than we should. Thedesign intent was that itshould be a Content-of-last-resort. I have ideas on how tosignificantly reduce the usage,which is the general direction and motivation for a lot of thisround of cleanup work.
I wonder if normalization should be deferred until the latest possible moment. 
The reason is that
this operation doesn't seem to be additive, i.e.

     normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B).
My initial thought had been to do the normalization on the fly whilewriting out the characters,but that would mean writing the file character at a time, instead ofbulk-write. It would also meana bigger semantic change to leave un-normalized newlines in the datamembers ... notethat there is code to check if items "end with newline" which isdone by checking if the lastcharacter is DocletConstants.NL. It also seemed inconsistent to bescanning for HTMLcharacters while leaving newlines alone. If we were to defernormalization until writing out Contentobjects, I would consider deferring the <>& translation as well.That might be a good idea but could
be done separately as a later more-localized  changeset.
Note that there is a hidden assumption that newlines are completelyrepresented ineach call to a method or constructor for any subtype of Content. Even if we deferrednormalization to the very last possible moment, (i.e. writing out)we would not be ableto correctly handle a \r\n pair split across a pair of consecutiveContent nodes. I don't
think this is a case worth worrying about.
-Pavel
On 17 Dec 2019, at 02:30, Jonathan Gibbons<[email protected]>  wrote:

Please review a moderately simple change for javadoc, to improve the way that 
newlines in user input (doc comments and values for command-line options) are 
handled, reducing the amount of overall string copying.  The underlying issue 
is that user newlines may be \n, \r or \r\n, but the generated files should 
only contain \n.

Instead of using Utils.normalizeNewlines (which does an unconditional copy) at a number 
of sites, the normalization is now done in StringContent and RawHtml. In the case of 
StringContent, the normalization is done while checking and handling the escape sequences 
for < > &; in the case of RawHtml, it is done when constructing the object, but 
only if the argument contains \r.

There are no good pre-existing tests for testing newline behavior, and so a new 
test is provided that runs javadoc on input containing the different kinds of 
newline.  The test revealed a previously unknown bug, that non-standard 
newslines in command-line option values were not handled correctly.

The output is also the same (before-and-after) for the generated JDK API docs.

-- Jon

JBS:https://bugs.openjdk.java.net/browse/JDK-8236048
Webrev:http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html

Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Reply via email to