On Sun, 14 Jan 2024 at 05:38, Claude Warren <[email protected]> wrote:
>
> The reasons it did not rise linearly with more SPDX checks is that the SPDX
> check works differently from most of the other matchers.  Because the same
> regex call is used  by all SPDX matchers the system only runs the match
> once and every SPDX matcher then looks to see if they were matched.  So
> there will be very slight increase as the number of SPDX matchers is
> increased.
>
> However, the number of regex calls in 16 is significantly higher than in
> 15.  And your metering code surrounds the issue nicely.
> If you change the first "if" statement in the check method to read
>
> if ((lastLine == null || !lastLine.equals(line)) && line.contains(
> "SPDX-License-Identifier")) {
>
> performance will increase dramatically and return to approx the same level
> as v0.15.  This is the change I have put forward in a pull request 192.
> What this change does is verifies that the line contains something that
> looks like the SPDX identifier before attempting the slow regex
> extraction.  This regex would otherwise fire on every line read from every
> file.
>
> However, this change has uncovered a latent issues somewhere in the code.
> When I run the code using maven on my local machine all the tests pass,
> when the CI system runs it some tests fail.  When I run the entire test
> suite in my IDE (Eclipse) 3 tests fail.  When I run those test classes in
> my IDE individually they pass.
>
> The errors appear to indicate that the wrong licenses are firing as it
> reports GPL3 for al the non-binary, non-archive, non-notice files.  And it
> occurs early enough that the XML report generation does not see any Apache
> licenses triggering.
>

Can you reproduce my findings regarding increased (doubled) time in case the
list of files with unapproved licenses need to be printed?


> Claude
>
> On Sat, Jan 13, 2024 at 9:56 PM Jochen Wiedmann <[email protected]>
> wrote:
>
> > Hi, I'd like to discuss RAT-325 here, as I think, that this is the
> > proper place for such discussions.
> >
> > Let me start by giving an outline of the information, that I currently
> > have:
> >
> > In RAT-325, the original reporter claimed to see extremely different
> > results in terms of performance between 0.15, and 0.16. This claim has
> > later on been confirmed by another user who also described how to
> > reproduce the issue on the source code of Apache Openmeetings. Using
> > that description, I was able to confirm that there is, indeed, a
> > massive gap.
> >
> > The discussion quickly concemtrated on the SPDX support (more
> > precisely: The RegExp handling) as the most likely suspect. My
> > understanding is, that this feature has been introduced in 0.16, so
> > the assumption appears to be natural. On the other hand, as far as I
> > can tell, no evidence has been given so far, that nails down the fact.
> >
> > In order to get some hard data, I did an experiment by changing the
> > source code of SPDXMatcherFactory.check(String,Match) as follows:
> >
> >     private long totalCalls = 0;
> >     private long totalTime = 0;
> >     private boolean check(String line, Match caller) {
> >          final long startTime = System.currentTimeMillis();
> >          /* Real code follows here, creating a boolean variable result. */
> >          final long endTime = System.currentTimeMillis();
> >          totalTime += (endTime-startTime);
> >          ++totalCalls;
> >          System.out.println("check: totalCalls="
> >                                          + totalCalls + ", totalTime="
> > + totalTime);
> >           return result;
> >     }
> >
> > My assuption was: If the RegExp code (which is used in that method) is
> > the problem, then I would see the variable totalTime rise very
> > quickly, and roughly linear with the variable totalCalls. However,
> > that is not the case. Quoting from the output of "mvn clean
> > apache-rat:0.16.1-SNAPSHOT:check" in openmeetings/openmeetings-web, I
> > see
> >
> >     check: totalCalls=377961, totalTime=6018
> >     check: totalCalls=377962, totalTime=6018
> >     check: totalCalls=377963, totalTime=6018
> >     check: totalCalls=377964, totalTime=53385
> >     check: totalCalls=377965, totalTime=97949
> >     check: totalCalls=377966, totalTime=151063
> >     check: totalCalls=377967, totalTime=197750
> >
> > In summary, over the first 377963 calls, the performance is just fine,
> > with localTime growing much slower than totalCalls. However, beginning
> > with totalCalls=377964, the picture changes completely.
> >
> > These results are, of course, strictly local on my machine (a rather
> > limited Chromebook), and perhaps not reproducable elsewhere. However,
> > if they are, then there is something going on, that I do not really
> > understand.
> >
> > So, please try to reproduce this, and let me know you results, and/or
> > ideas.
> >
> > Thanks,
> >
> > Jochen
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > The woman was born in a full-blown thunderstorm. She probably told it
> > to be quiet. It probably did. (Robert Jordan, Winter's heart)
> >
>
>
> --
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
Best regards,
Maxim

Reply via email to