On Sun, 14 Jan 2024 at 05:38, Claude Warren <[email protected]> wrote: > > The reasons it did not rise linearly with more SPDX checks is that the SPDX > check works differently from most of the other matchers. Because the same > regex call is used by all SPDX matchers the system only runs the match > once and every SPDX matcher then looks to see if they were matched. So > there will be very slight increase as the number of SPDX matchers is > increased. > > However, the number of regex calls in 16 is significantly higher than in > 15. And your metering code surrounds the issue nicely. > If you change the first "if" statement in the check method to read > > if ((lastLine == null || !lastLine.equals(line)) && line.contains( > "SPDX-License-Identifier")) { > > performance will increase dramatically and return to approx the same level > as v0.15. This is the change I have put forward in a pull request 192. > What this change does is verifies that the line contains something that > looks like the SPDX identifier before attempting the slow regex > extraction. This regex would otherwise fire on every line read from every > file. > > However, this change has uncovered a latent issues somewhere in the code. > When I run the code using maven on my local machine all the tests pass, > when the CI system runs it some tests fail. When I run the entire test > suite in my IDE (Eclipse) 3 tests fail. When I run those test classes in > my IDE individually they pass. > > The errors appear to indicate that the wrong licenses are firing as it > reports GPL3 for al the non-binary, non-archive, non-notice files. And it > occurs early enough that the XML report generation does not see any Apache > licenses triggering. >
Can you reproduce my findings regarding increased (doubled) time in case the list of files with unapproved licenses need to be printed? > Claude > > On Sat, Jan 13, 2024 at 9:56 PM Jochen Wiedmann <[email protected]> > wrote: > > > Hi, I'd like to discuss RAT-325 here, as I think, that this is the > > proper place for such discussions. > > > > Let me start by giving an outline of the information, that I currently > > have: > > > > In RAT-325, the original reporter claimed to see extremely different > > results in terms of performance between 0.15, and 0.16. This claim has > > later on been confirmed by another user who also described how to > > reproduce the issue on the source code of Apache Openmeetings. Using > > that description, I was able to confirm that there is, indeed, a > > massive gap. > > > > The discussion quickly concemtrated on the SPDX support (more > > precisely: The RegExp handling) as the most likely suspect. My > > understanding is, that this feature has been introduced in 0.16, so > > the assumption appears to be natural. On the other hand, as far as I > > can tell, no evidence has been given so far, that nails down the fact. > > > > In order to get some hard data, I did an experiment by changing the > > source code of SPDXMatcherFactory.check(String,Match) as follows: > > > > private long totalCalls = 0; > > private long totalTime = 0; > > private boolean check(String line, Match caller) { > > final long startTime = System.currentTimeMillis(); > > /* Real code follows here, creating a boolean variable result. */ > > final long endTime = System.currentTimeMillis(); > > totalTime += (endTime-startTime); > > ++totalCalls; > > System.out.println("check: totalCalls=" > > + totalCalls + ", totalTime=" > > + totalTime); > > return result; > > } > > > > My assuption was: If the RegExp code (which is used in that method) is > > the problem, then I would see the variable totalTime rise very > > quickly, and roughly linear with the variable totalCalls. However, > > that is not the case. Quoting from the output of "mvn clean > > apache-rat:0.16.1-SNAPSHOT:check" in openmeetings/openmeetings-web, I > > see > > > > check: totalCalls=377961, totalTime=6018 > > check: totalCalls=377962, totalTime=6018 > > check: totalCalls=377963, totalTime=6018 > > check: totalCalls=377964, totalTime=53385 > > check: totalCalls=377965, totalTime=97949 > > check: totalCalls=377966, totalTime=151063 > > check: totalCalls=377967, totalTime=197750 > > > > In summary, over the first 377963 calls, the performance is just fine, > > with localTime growing much slower than totalCalls. However, beginning > > with totalCalls=377964, the picture changes completely. > > > > These results are, of course, strictly local on my machine (a rather > > limited Chromebook), and perhaps not reproducable elsewhere. However, > > if they are, then there is something going on, that I do not really > > understand. > > > > So, please try to reproduce this, and let me know you results, and/or > > ideas. > > > > Thanks, > > > > Jochen > > > > > > > > > > > > > > > > > > > > -- > > The woman was born in a full-blown thunderstorm. She probably told it > > to be quiet. It probably did. (Robert Jordan, Winter's heart) > > > > > -- > LinkedIn: http://www.linkedin.com/in/claudewarren -- Best regards, Maxim
