Re: CLC, inclusive language, and Apache

sebb Tue, 31 Aug 2021 16:48:11 -0700

On Wed, 1 Sept 2021 at 00:30, Matt Sicker <[email protected]> wrote:
>
> Security scanners, compilers, linters, etc., are all incredibly noisy when 
> first enabled on an existing codebase. I’d expect similarly for tools that 
> attempt to lint the English language. If the tool were smart enough to avoid 
> false positives, it might also be superintelligent. Remember, naming things 
> is one of the hardest parts of programming!


That is not the case here.
The analogy of a compiler is particularly inappropriate: if a compiler
reports a problem, it has to be fixed.

The way words are detected currently is bound to cause false positives.
As I already wrote, the way to handle this is to run the checker on a
substantial codebase, and look for projects with an abnormal number of
hits.
Those are likely to be false positives. Fix those and try again.

Again, if the first run does produce lots of hits, then be more
conservative in matching for the initial run.
For example, look for words which are bracketed by white-space and
perhaps quotes, nothing more.
If that produces no hits, gradually widen the matching.

There is no point producing an initial analysis with hundreds of hits.

Sorry, but IMO the problem here is insufficient testing.

> Matt Sicker
>
> > On Aug 31, 2021, at 17:20, sebb <[email protected]> wrote:
> >
> > On Tue, 31 Aug 2021 at 17:50, Rich Bowen <[email protected]> wrote:
> >>
> >>
> >>
> >>> On 8/31/21 12:24 PM, sebb wrote:
> >>>
> >>> That seems to me to be an overreaction.
> >>
> >> Yes, I can see that it would seem that way without a larger context. The
> >> number of messages I have received on various lists, and off-list,
> >> calling this effort wrong/bad/evil, have been ... demoralizing, shall we
> >> say?
> >>
> >>> In my case, I have no complaints about the purpose of the analysis.
> >>> It's the excessing false positives and UI of the software that is the
> >>> problem, combined with a poorly worded email.
> >>
> >> I appreciate that you have no complaints about the purpose of the
> >> analysis. Others do, and have made those complaints both very obvious
> >> and very personal.
> >>
> >> While this is often the case with this conversation, the vitriol this
> >> time has been somewhat disturbing. And that's from someone who has had
> >> this conversation with probably 200 projects over the past 18 months.
> >>
> >>> I think what needs to happen is for a detailed investigation of the
> >>> results, especially for projects that have lots of hits, so that the
> >>> scanning can be properly tuned.
> >>> It's pretty obvious at present that the scanning is far too eager to
> >>> report issues (and not just master in URLs).
> >>
> >> I disagree.
> >
> > Have you actually looked at any of the scans?
> > In the case of commons-csv, there were over 1800 reports of the use of
> > 'he' in code.
> > However these were all parts of a test data file, for example:
> >
> > ..|=he|פוזארוואץ|...
> >
> > I assume that is he for Hebrew; it should not have been flagged.
> >
> >> I think that highlighting all potential problematic
> >> words/phrases is part of the message, whether or not the project in
> >> question feels the need to address all of them. The purpose here is to
> >> make people aware of how the words/phrases in their code and
> >> documentation affect other people.
> >>
> >>> There also needs to be some work on the UI, to make it easier to
> >>> ignore individual files, and to make it easier to actually edit the
> >>> source files.
> >>
> >> It is not the goal of the tool to make editing source files easy or even
> >> possible. It's a code analysis tool. Sure, it could link to the file in
> >> the target repository, which may be what you're asking for. But it's not
> >> intended to be a remediation tool.
> >
> > The easier you make it for projects, the more likely they are to use
> > it and persuade others to do the same.
> >
> > I am only suggesting providing links to the Git repo files.
> > Assuming that the tool has local checkouts of the repo, that should
> > not be hard to do.
> > At present not even the Github source repos listed at the head of the
> > page are linked, and they are already URLs.
> >
> >>> There are some other issue no doubt.
> >>>
> >>> Once the reports are usable without lots of effort by projects, then
> >>> maybe start inviting a few random projects to see if they have any
> >>> feedback on the analysis.
> >>> Fix any issues, and gradually increase the number of projects.
> >>>
> >>> It might be an idea to send a follow-up email to explain why all the
> >>> projects have been removed.
> >>
> >> One of the things we were reprimanded for was sending a cross-project
> >> email about this topic in the first place. As such, I won't be
> >> advocating sending a followup email on the same topic. Someone else is,
> >> of course, welcome to pursue that avenue.
> >>
> >>>
> >>> Though I think it would have been better to keep the projects (apart
> >>> from retired ones), but send an email to say that the analyses are
> >>> currently at the alpha stage, and solicit feedback on improving the
> >>> scanning.
> >>
> >> They're *not* at an alpha stage. Those words *do* appear in the code.
> >> And I'm already using this same tool elsewhere, as part of my day job.
> >> It's a tool. It wasn't the tool that people objected to. It was the
> >> analysis.
> >
> > I only object to the analysis inasmuch as it provides too many false 
> > positives.
> >
> >>>
> >>> That way might result in analyses that projects actually want.
> >>
> >> My take-away was that the projects *don't* want this analysis.
> >
> > That may be true for some; I don't think it is true for all.
> >
> > But it will remain so unless the tool is a lot easier to use.
> >
> >>
> >> --
> >> Rich Bowen - [email protected]
> >> @rbowen

Re: CLC, inclusive language, and Apache

Reply via email to