Re: CLC, inclusive language, and Apache

Rich Bowen Fri, 03 Sep 2021 04:42:39 -0700

I guess, then, it's a good thing that we're software engineers, and,therefore, don't shy away from problems just because they're hard.

FWIW, no, it's not hard to find alternatives to "Slave" in software. Forexample, the Linux Kernel folks came up with quite a list of them:https://www.zdnet.com/article/linux-team-approves-new-terminology-bans-terms-like-blacklist-and-slave/

I have been using this tool (and another very like it) for most of thisyear, in addition to the other natural language processing tool Ipossess (my brain) to work with projects to address conscious languagechoices, as part of my day job. Language processing can indeed bechallenging. This kind of software aids in the process, but, unless wewant to implement an entire natural language parsing tool (which, Isuppose, is an option) this one augments my wetware one.

It seems, though, that you question the very notion of striving foravoiding historically problematic words/phrases in our projects. And, ofcourse, that's a position that you're welcome to take.

But your criticism of the tool seems to rather go against the longstanding Apache tradition of *improving* software rather than simplywriting it off as "bad". After all these years, Stefano still says itbest: s.apache.org/hZ


On 9/2/21 9:34 PM, William A Rowe Jr wrote:

On Tue, Aug 31, 2021 at 5:20 PM sebb <seb...@gmail.com> wrote:


In the case of commons-csv, there were over 1800 reports of the use of
'he' in code.
However these were all parts of a test data file, for example:

..|=he|פוזארוואץ|...


Not only is this a language code, it cannot usefully be flagged without some
contextual language mechanics. For example, "He" is one of the 20 most common
surnames in mainland China and common in many other countries. As their
preferred pronoun of many people, "he" is perfectly useful and must not be
substituted. The first definition here cannot be substituted, the
second certainly
can be and should be considered; https://www.merriam-webster.com/dictionary/he
As we have few chemistry-based projects, the chemical symbol isn't a problem
now, but would be for such projects.

The word "master" is equally problematic, also citing m-w from
https://www.merriam-webster.com/dictionary/master
A number of these uses are common among education and professional software
development related to educational attainment, expertise or position, a specific
job title as in various armed forces, a specific legal job title, and
as a reference
of origin source material. These can't be substituted.

It would be enough to flag the term 'slave' or 'slavery' as
questionable. There is
an engineering phrase using both, but it's hard to find other
reasonable uses for
this word to describe processes, especially new processes.

Natural language systems are designed to code meanings out of ambiguous
words. I'm not sure this tool will be at all useful without that sort
of context, so
it's maybe best to drop these excessively ambiguous examples from the report
for the time being, until the right set of tools is applied?

It's a tool. It wasn't the tool that people objected to. It was the
analysis.


I only object to the analysis inasmuch as it provides too many false positives.


And there is legitimate reason to question the tool itself, as pointed
out above.
We have an entire raft of software engineers who devote their careers to
sieving context out of text, why not apply NLP to this problem set? We will
clearly not evict Helium from the table of elements but can identify not only
"he" as that particular guy, but "he" as applied to a role or position.

But comparing the tool to the execution, it was pretty bad. I don't mean the
messaging to the projects, but the format of the report. It's not reasonable to
have to drill down to see the one-line context of a word occurence. This could
have been a 30 second exercise by most projects to simply look at their scan
results and see each of the contextual examples. The language code "he" would
have been blatantly obvious and those results skimmed through to the bottom
of the report. Drill down features are great, but they should be reserved for
gaining broader perspective, an entire paragraph worth of context, but the
single sentence or sgml/xml tag or line should be plainly visible.

Ignoring a "word" doesn't help. Ignoring the context will be much more helpful.

I used 
https://clc.diversity.apache.org/analysis.html?project=tomcat-training.git
for initially evaluating the tooling but it sounds like there were
more frustrating
examples than this.


--
Rich Bowen
rbo...@rcbowen.com

Re: CLC, inclusive language, and Apache

Reply via email to