Re: CLC, inclusive language, and Apache

William A Rowe Jr Thu, 02 Sep 2021 18:34:48 -0700

On Tue, Aug 31, 2021 at 5:20 PM sebb <[email protected]> wrote:
>
> In the case of commons-csv, there were over 1800 reports of the use of
> 'he' in code.
> However these were all parts of a test data file, for example:
>
> ..|=he|פוזארוואץ|...


Not only is this a language code, it cannot usefully be flagged without some
contextual language mechanics. For example, "He" is one of the 20 most common
surnames in mainland China and common in many other countries. As their
preferred pronoun of many people, "he" is perfectly useful and must not be
substituted. The first definition here cannot be substituted, the
second certainly
can be and should be considered; https://www.merriam-webster.com/dictionary/he
As we have few chemistry-based projects, the chemical symbol isn't a problem
now, but would be for such projects.

The word "master" is equally problematic, also citing m-w from
https://www.merriam-webster.com/dictionary/master
A number of these uses are common among education and professional software
development related to educational attainment, expertise or position, a specific
job title as in various armed forces, a specific legal job title, and
as a reference
of origin source material. These can't be substituted.

It would be enough to flag the term 'slave' or 'slavery' as
questionable. There is
an engineering phrase using both, but it's hard to find other
reasonable uses for
this word to describe processes, especially new processes.

Natural language systems are designed to code meanings out of ambiguous
words. I'm not sure this tool will be at all useful without that sort
of context, so
it's maybe best to drop these excessively ambiguous examples from the report
for the time being, until the right set of tools is applied?

> > It's a tool. It wasn't the tool that people objected to. It was the
> > analysis.
>
> I only object to the analysis inasmuch as it provides too many false 
> positives.

And there is legitimate reason to question the tool itself, as pointed
out above.
We have an entire raft of software engineers who devote their careers to
sieving context out of text, why not apply NLP to this problem set? We will
clearly not evict Helium from the table of elements but can identify not only
"he" as that particular guy, but "he" as applied to a role or position.

But comparing the tool to the execution, it was pretty bad. I don't mean the
messaging to the projects, but the format of the report. It's not reasonable to
have to drill down to see the one-line context of a word occurence. This could
have been a 30 second exercise by most projects to simply look at their scan
results and see each of the contextual examples. The language code "he" would
have been blatantly obvious and those results skimmed through to the bottom
of the report. Drill down features are great, but they should be reserved for
gaining broader perspective, an entire paragraph worth of context, but the
single sentence or sgml/xml tag or line should be plainly visible.

Ignoring a "word" doesn't help. Ignoring the context will be much more helpful.

I used 
https://clc.diversity.apache.org/analysis.html?project=tomcat-training.git
for initially evaluating the tooling but it sounds like there were
more frustrating
examples than this.

Re: CLC, inclusive language, and Apache

Reply via email to