Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Jim Laskey Wed, 18 Dec 2019 06:48:06 -0800

Possibly, since the match is likely more expensive than the replace, which is 
just appending to the sb in both cases.



> On Dec 18, 2019, at 10:38 AM, Jonathan Gibbons <[email protected]> 
> wrote:
> 
> Hi Jim,
> 
> I started off using \R, but that unnecessarily matches \n as well as \r and 
> \r\n.   In this context, we're only looking to replace newlines containing 
> \r. Do you still think using \R would be better?
> 
> -- Jon
> 
> On 12/17/19 5:00 PM, James Laskey wrote:
>> Note: you can use the regular expression \R to match line terminators. A bit 
>> faster than your pattern. I found this out when testing String::lines() 
>> (which was faster still.)
>> 
>> Cheers,
>> 
>> — Jim
>> 
>> On the road.
>> 
>>> On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons <[email protected]> 
>>> <mailto:[email protected]> wrote:
>>> 
>>> 
>>> Updated webrev. This version uses a regular expression and .replaceAll to 
>>> replace \r\n? with \n in RawHtml. No other changes.
>>> 
>>> http://cr.openjdk.java.net/~jjg/8236048/webrev.01/ 
>>> <http://cr.openjdk.java.net/~jjg/8236048/webrev.01/>
>>> -- Jon
>>> 
>>> 
>>> On 12/17/2019 10:44 AM, Jonathan Gibbons wrote:
>>>> Inline...
>>>> 
>>>> 
>>>> On 12/17/2019 05:51 AM, Pavel Rappo wrote:
>>>>> Jon,
>>>>> 
>>>>> Could you please explain why "generated files should only contain \n"?
>>>> 
>>>> I don't have a robust spec-based answer, so the answer is more along the 
>>>> lines of ...
>>>> 
>>>> * because we've always done it this way ... related, back in the day, 
>>>> Apache was the
>>>>    gold-standard webserver, and was typically running on Linux systems, so 
>>>> favoring that
>>>>    config seemd a good idea
>>>> * for consistency in the output between user-provided newlines and 
>>>> generated newlines
>>>>     ... I did find a spec reference that said to not be inconsistent
>>>> * for self-defence in our self-tests, maybe predating back to before 
>>>> Mercurial and before we
>>>>   were so rigorous about enforcing Unix-newlines only in our source code
>>>> 
>>>> 
>>>>> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization 
>>>>> check first thing in
>>>>> normalizeNewlines?
>>>> 
>>>> Are you suggesting it would be better to always call normalizeNewlines and 
>>>> so move the
>>>> check into the method?  I guess I was thinking it was worth checking to 
>>>> detect whether it was
>>>> necessary to call the method in the first place.
>>>> 
>>>>> New mechanics of string normalization is more readable, albeit is 
>>>>> probably slower in a typical case.
>>>>> Old mechanics used bulk append, whereas the new one goes char by char. 
>>>>> Not sure if it is of any
>>>>> practical significance though.
>>>> 
>>>> I don't see why it will be slower in any case.   In all previous cases, we 
>>>> were unconditionally 
>>>> scanning the string character-by-character in Utils.normalizeNewlines and 
>>>> building up a 
>>>> new StringBuilder, which was then converted to a String.  
>>>> 
>>>> Now, in StringContent, the scan is merged with the scan to escape the HTML 
>>>> characters.
>>>> In RawHtml content, the scan/copy is only done if a \r is found.  Yes, we 
>>>> have to do a read-scan
>>>> but we only copy/update the string if needed, which is going to be faster 
>>>> than the always-copy
>>>> that was done before.
>>>> 
>>>> One possible suggestion/improvement, ... but I don't know the performance 
>>>> cost
>>>> 
>>>> * remove RawHtml.normalizeNewlines completely
>>>> * change the RawHtml constructor to a regex, as in
>>>>   52         rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : 
>>>> rawHtml.replaceAll("\\R", "\n");
>>>> I must admit, I sorta like this, even if it might be a bit slower, to use 
>>>> a regex.
>>>> 
>>>> Note, as an aside, we use RawHtml way more than we should. The design 
>>>> intent was that it
>>>> should be a Content-of-last-resort.  I have ideas on how to significantly 
>>>> reduce the usage,
>>>> which is the general direction and motivation for a lot of this round of 
>>>> cleanup work.
>>>> 
>>>>> I wonder if normalization should be deferred until the latest possible 
>>>>> moment. The reason is that
>>>>> this operation doesn't seem to be additive, i.e.
>>>>> 
>>>>>     normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + 
>>>>> B).
>>>> 
>>>> My initial thought had been to do the normalization on the fly while 
>>>> writing out the characters, 
>>>> but that would mean writing the file character at a time, instead of 
>>>> bulk-write.  It would also mean
>>>> a bigger semantic change to leave un-normalized newlines in the data 
>>>> members ... note
>>>> that there is code to check if items "end with newline" which is done by 
>>>> checking if the last
>>>> character is DocletConstants.NL.  It also seemed inconsistent to be 
>>>> scanning for HTML
>>>> characters while leaving newlines alone.   If we were to defer 
>>>> normalization until writing out Content
>>>> objects, I would consider deferring the <>& translation as well. That 
>>>> might be a good idea but could 
>>>> be done separately as a later more-localized  changeset.
>>>> 
>>>> Note that there is a hidden assumption that newlines are completely 
>>>> represented in
>>>> each call to a method or constructor for any subtype of Content.   Even if 
>>>> we deferred
>>>> normalization to the very last possible moment, (i.e. writing out) we 
>>>> would not be able
>>>> to correctly handle a \r\n pair split across a pair of consecutive Content 
>>>> nodes. I don't
>>>> think this is a case worth worrying about.
>>>> 
>>>> 
>>>> 
>>>>> -Pavel
>>>>> 
>>>>>> On 17 Dec 2019, at 02:30, Jonathan Gibbons <[email protected]> 
>>>>>> <mailto:[email protected]> wrote:
>>>>>> 
>>>>>> Please review a moderately simple change for javadoc, to improve the way 
>>>>>> that newlines in user input (doc comments and values for command-line 
>>>>>> options) are handled, reducing the amount of overall string copying.  
>>>>>> The underlying issue is that user newlines may be \n, \r or \r\n, but 
>>>>>> the generated files should only contain \n.
>>>>>> 
>>>>>> Instead of using Utils.normalizeNewlines (which does an unconditional 
>>>>>> copy) at a number of sites, the normalization is now done in 
>>>>>> StringContent and RawHtml. In the case of StringContent, the 
>>>>>> normalization is done while checking and handling the escape sequences 
>>>>>> for < > &; in the case of RawHtml, it is done when constructing the 
>>>>>> object, but only if the argument contains \r.
>>>>>> 
>>>>>> There are no good pre-existing tests for testing newline behavior, and 
>>>>>> so a new test is provided that runs javadoc on input containing the 
>>>>>> different kinds of newline.  The test revealed a previously unknown bug, 
>>>>>> that non-standard newslines in command-line option values were not 
>>>>>> handled correctly.
>>>>>> 
>>>>>> The output is also the same (before-and-after) for the generated JDK API 
>>>>>> docs.
>>>>>> 
>>>>>> -- Jon
>>>>>> 
>>>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8236048 
>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8236048>
>>>>>> Webrev: http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html 
>>>>>> <http://cr.openjdk.java.net/%7Ejjg/8236048/webrev.00/index.html>
>>>>>> 
>>>> 
>>>

Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Reply via email to