On Tue, Aug 31, 2021 at 5:20 PM sebb <seb...@gmail.com> wrote: > > In the case of commons-csv, there were over 1800 reports of the use of > 'he' in code. > However these were all parts of a test data file, for example: > > ..|=he|פוזארוואץ|...
Not only is this a language code, it cannot usefully be flagged without some contextual language mechanics. For example, "He" is one of the 20 most common surnames in mainland China and common in many other countries. As their preferred pronoun of many people, "he" is perfectly useful and must not be substituted. The first definition here cannot be substituted, the second certainly can be and should be considered; https://www.merriam-webster.com/dictionary/he As we have few chemistry-based projects, the chemical symbol isn't a problem now, but would be for such projects. The word "master" is equally problematic, also citing m-w from https://www.merriam-webster.com/dictionary/master A number of these uses are common among education and professional software development related to educational attainment, expertise or position, a specific job title as in various armed forces, a specific legal job title, and as a reference of origin source material. These can't be substituted. It would be enough to flag the term 'slave' or 'slavery' as questionable. There is an engineering phrase using both, but it's hard to find other reasonable uses for this word to describe processes, especially new processes. Natural language systems are designed to code meanings out of ambiguous words. I'm not sure this tool will be at all useful without that sort of context, so it's maybe best to drop these excessively ambiguous examples from the report for the time being, until the right set of tools is applied? > > It's a tool. It wasn't the tool that people objected to. It was the > > analysis. > > I only object to the analysis inasmuch as it provides too many false > positives. And there is legitimate reason to question the tool itself, as pointed out above. We have an entire raft of software engineers who devote their careers to sieving context out of text, why not apply NLP to this problem set? We will clearly not evict Helium from the table of elements but can identify not only "he" as that particular guy, but "he" as applied to a role or position. But comparing the tool to the execution, it was pretty bad. I don't mean the messaging to the projects, but the format of the report. It's not reasonable to have to drill down to see the one-line context of a word occurence. This could have been a 30 second exercise by most projects to simply look at their scan results and see each of the contextual examples. The language code "he" would have been blatantly obvious and those results skimmed through to the bottom of the report. Drill down features are great, but they should be reserved for gaining broader perspective, an entire paragraph worth of context, but the single sentence or sgml/xml tag or line should be plainly visible. Ignoring a "word" doesn't help. Ignoring the context will be much more helpful. I used https://clc.diversity.apache.org/analysis.html?project=tomcat-training.git for initially evaluating the tooling but it sounds like there were more frustrating examples than this.