Possibly, since the match is likely more expensive than the replace, which is just appending to the sb in both cases.
> On Dec 18, 2019, at 10:38 AM, Jonathan Gibbons <[email protected]> > wrote: > > Hi Jim, > > I started off using \R, but that unnecessarily matches \n as well as \r and > \r\n. In this context, we're only looking to replace newlines containing > \r. Do you still think using \R would be better? > > -- Jon > > On 12/17/19 5:00 PM, James Laskey wrote: >> Note: you can use the regular expression \R to match line terminators. A bit >> faster than your pattern. I found this out when testing String::lines() >> (which was faster still.) >> >> Cheers, >> >> — Jim >> >> On the road. >> >>> On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons <[email protected]> >>> <mailto:[email protected]> wrote: >>> >>> >>> Updated webrev. This version uses a regular expression and .replaceAll to >>> replace \r\n? with \n in RawHtml. No other changes. >>> >>> http://cr.openjdk.java.net/~jjg/8236048/webrev.01/ >>> <http://cr.openjdk.java.net/~jjg/8236048/webrev.01/> >>> -- Jon >>> >>> >>> On 12/17/2019 10:44 AM, Jonathan Gibbons wrote: >>>> Inline... >>>> >>>> >>>> On 12/17/2019 05:51 AM, Pavel Rappo wrote: >>>>> Jon, >>>>> >>>>> Could you please explain why "generated files should only contain \n"? >>>> >>>> I don't have a robust spec-based answer, so the answer is more along the >>>> lines of ... >>>> >>>> * because we've always done it this way ... related, back in the day, >>>> Apache was the >>>> gold-standard webserver, and was typically running on Linux systems, so >>>> favoring that >>>> config seemd a good idea >>>> * for consistency in the output between user-provided newlines and >>>> generated newlines >>>> ... I did find a spec reference that said to not be inconsistent >>>> * for self-defence in our self-tests, maybe predating back to before >>>> Mercurial and before we >>>> were so rigorous about enforcing Unix-newlines only in our source code >>>> >>>> >>>>> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization >>>>> check first thing in >>>>> normalizeNewlines? >>>> >>>> Are you suggesting it would be better to always call normalizeNewlines and >>>> so move the >>>> check into the method? I guess I was thinking it was worth checking to >>>> detect whether it was >>>> necessary to call the method in the first place. >>>> >>>>> New mechanics of string normalization is more readable, albeit is >>>>> probably slower in a typical case. >>>>> Old mechanics used bulk append, whereas the new one goes char by char. >>>>> Not sure if it is of any >>>>> practical significance though. >>>> >>>> I don't see why it will be slower in any case. In all previous cases, we >>>> were unconditionally >>>> scanning the string character-by-character in Utils.normalizeNewlines and >>>> building up a >>>> new StringBuilder, which was then converted to a String. >>>> >>>> Now, in StringContent, the scan is merged with the scan to escape the HTML >>>> characters. >>>> In RawHtml content, the scan/copy is only done if a \r is found. Yes, we >>>> have to do a read-scan >>>> but we only copy/update the string if needed, which is going to be faster >>>> than the always-copy >>>> that was done before. >>>> >>>> One possible suggestion/improvement, ... but I don't know the performance >>>> cost >>>> >>>> * remove RawHtml.normalizeNewlines completely >>>> * change the RawHtml constructor to a regex, as in >>>> 52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : >>>> rawHtml.replaceAll("\\R", "\n"); >>>> I must admit, I sorta like this, even if it might be a bit slower, to use >>>> a regex. >>>> >>>> Note, as an aside, we use RawHtml way more than we should. The design >>>> intent was that it >>>> should be a Content-of-last-resort. I have ideas on how to significantly >>>> reduce the usage, >>>> which is the general direction and motivation for a lot of this round of >>>> cleanup work. >>>> >>>>> I wonder if normalization should be deferred until the latest possible >>>>> moment. The reason is that >>>>> this operation doesn't seem to be additive, i.e. >>>>> >>>>> normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + >>>>> B). >>>> >>>> My initial thought had been to do the normalization on the fly while >>>> writing out the characters, >>>> but that would mean writing the file character at a time, instead of >>>> bulk-write. It would also mean >>>> a bigger semantic change to leave un-normalized newlines in the data >>>> members ... note >>>> that there is code to check if items "end with newline" which is done by >>>> checking if the last >>>> character is DocletConstants.NL. It also seemed inconsistent to be >>>> scanning for HTML >>>> characters while leaving newlines alone. If we were to defer >>>> normalization until writing out Content >>>> objects, I would consider deferring the <>& translation as well. That >>>> might be a good idea but could >>>> be done separately as a later more-localized changeset. >>>> >>>> Note that there is a hidden assumption that newlines are completely >>>> represented in >>>> each call to a method or constructor for any subtype of Content. Even if >>>> we deferred >>>> normalization to the very last possible moment, (i.e. writing out) we >>>> would not be able >>>> to correctly handle a \r\n pair split across a pair of consecutive Content >>>> nodes. I don't >>>> think this is a case worth worrying about. >>>> >>>> >>>> >>>>> -Pavel >>>>> >>>>>> On 17 Dec 2019, at 02:30, Jonathan Gibbons <[email protected]> >>>>>> <mailto:[email protected]> wrote: >>>>>> >>>>>> Please review a moderately simple change for javadoc, to improve the way >>>>>> that newlines in user input (doc comments and values for command-line >>>>>> options) are handled, reducing the amount of overall string copying. >>>>>> The underlying issue is that user newlines may be \n, \r or \r\n, but >>>>>> the generated files should only contain \n. >>>>>> >>>>>> Instead of using Utils.normalizeNewlines (which does an unconditional >>>>>> copy) at a number of sites, the normalization is now done in >>>>>> StringContent and RawHtml. In the case of StringContent, the >>>>>> normalization is done while checking and handling the escape sequences >>>>>> for < > &; in the case of RawHtml, it is done when constructing the >>>>>> object, but only if the argument contains \r. >>>>>> >>>>>> There are no good pre-existing tests for testing newline behavior, and >>>>>> so a new test is provided that runs javadoc on input containing the >>>>>> different kinds of newline. The test revealed a previously unknown bug, >>>>>> that non-standard newslines in command-line option values were not >>>>>> handled correctly. >>>>>> >>>>>> The output is also the same (before-and-after) for the generated JDK API >>>>>> docs. >>>>>> >>>>>> -- Jon >>>>>> >>>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8236048 >>>>>> <https://bugs.openjdk.java.net/browse/JDK-8236048> >>>>>> Webrev: http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html >>>>>> <http://cr.openjdk.java.net/%7Ejjg/8236048/webrev.00/index.html> >>>>>> >>>> >>>
