The reasons it did not rise linearly with more SPDX checks is that the SPDX
check works differently from most of the other matchers.  Because the same
regex call is used  by all SPDX matchers the system only runs the match
once and every SPDX matcher then looks to see if they were matched.  So
there will be very slight increase as the number of SPDX matchers is
increased.

However, the number of regex calls in 16 is significantly higher than in
15.  And your metering code surrounds the issue nicely.
If you change the first "if" statement in the check method to read

if ((lastLine == null || !lastLine.equals(line)) && line.contains(
"SPDX-License-Identifier")) {

performance will increase dramatically and return to approx the same level
as v0.15.  This is the change I have put forward in a pull request 192.
What this change does is verifies that the line contains something that
looks like the SPDX identifier before attempting the slow regex
extraction.  This regex would otherwise fire on every line read from every
file.

However, this change has uncovered a latent issues somewhere in the code.
When I run the code using maven on my local machine all the tests pass,
when the CI system runs it some tests fail.  When I run the entire test
suite in my IDE (Eclipse) 3 tests fail.  When I run those test classes in
my IDE individually they pass.

The errors appear to indicate that the wrong licenses are firing as it
reports GPL3 for al the non-binary, non-archive, non-notice files.  And it
occurs early enough that the XML report generation does not see any Apache
licenses triggering.

Claude

On Sat, Jan 13, 2024 at 9:56 PM Jochen Wiedmann <jochen.wiedm...@gmail.com>
wrote:

> Hi, I'd like to discuss RAT-325 here, as I think, that this is the
> proper place for such discussions.
>
> Let me start by giving an outline of the information, that I currently
> have:
>
> In RAT-325, the original reporter claimed to see extremely different
> results in terms of performance between 0.15, and 0.16. This claim has
> later on been confirmed by another user who also described how to
> reproduce the issue on the source code of Apache Openmeetings. Using
> that description, I was able to confirm that there is, indeed, a
> massive gap.
>
> The discussion quickly concemtrated on the SPDX support (more
> precisely: The RegExp handling) as the most likely suspect. My
> understanding is, that this feature has been introduced in 0.16, so
> the assumption appears to be natural. On the other hand, as far as I
> can tell, no evidence has been given so far, that nails down the fact.
>
> In order to get some hard data, I did an experiment by changing the
> source code of SPDXMatcherFactory.check(String,Match) as follows:
>
>     private long totalCalls = 0;
>     private long totalTime = 0;
>     private boolean check(String line, Match caller) {
>          final long startTime = System.currentTimeMillis();
>          /* Real code follows here, creating a boolean variable result. */
>          final long endTime = System.currentTimeMillis();
>          totalTime += (endTime-startTime);
>          ++totalCalls;
>          System.out.println("check: totalCalls="
>                                          + totalCalls + ", totalTime="
> + totalTime);
>           return result;
>     }
>
> My assuption was: If the RegExp code (which is used in that method) is
> the problem, then I would see the variable totalTime rise very
> quickly, and roughly linear with the variable totalCalls. However,
> that is not the case. Quoting from the output of "mvn clean
> apache-rat:0.16.1-SNAPSHOT:check" in openmeetings/openmeetings-web, I
> see
>
>     check: totalCalls=377961, totalTime=6018
>     check: totalCalls=377962, totalTime=6018
>     check: totalCalls=377963, totalTime=6018
>     check: totalCalls=377964, totalTime=53385
>     check: totalCalls=377965, totalTime=97949
>     check: totalCalls=377966, totalTime=151063
>     check: totalCalls=377967, totalTime=197750
>
> In summary, over the first 377963 calls, the performance is just fine,
> with localTime growing much slower than totalCalls. However, beginning
> with totalCalls=377964, the picture changes completely.
>
> These results are, of course, strictly local on my machine (a rather
> limited Chromebook), and perhaps not reproducable elsewhere. However,
> if they are, then there is something going on, that I do not really
> understand.
>
> So, please try to reproduce this, and let me know you results, and/or
> ideas.
>
> Thanks,
>
> Jochen
>
>
>
>
>
>
>
>
>
> --
> The woman was born in a full-blown thunderstorm. She probably told it
> to be quiet. It probably did. (Robert Jordan, Winter's heart)
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to