Hi all,

As I mentioned on the other thread where I asked about D syntax, I'm a social scientist about to launch some studies of the effects of PL syntax on learnability, motivation to pursue programming, and differential gender effects on these factors. This is a long post – some of you wanted to know more about my research goals and rationale, and I also said I would post separately on the gender issue, so here we go...

As you know, women are starkly underrepresented in software engineering roles. I'm interested in zooming back to the decisions people are making when they're 16 or 19 re: programming as a career. I'm interested in people's *first encounters* with programming, in high school or college, how men and women might differentially assess programming as a career option, and why.

Let me note a few things: Someone on the other thread thought that my hypothesis was that women don't become programmers because of the semicolons and curly braces in PL syntax. That's not one of my hypotheses. I do think PL syntax is a large problem, and I have some hypotheses about how it disproportionately deters qualified women, but the issues I see go much deeper than what I've called the "punctuation noise" of semicolons and curly braces. (I definitely don't have any hypotheses about female perceptions of the aesthetics of curly braces, which some posters had inferred – none of this is about female aesthetic preferences.)

Also, I don't think D is particularly problematic – it has cleaner and clearer syntax than its contemporaries (well, we'll need careful research to know if it truly is clearer to a targeted population). I plan to use D as a presumptive *clearer syntax* condition in some studies – we'll see how it goes. Lastly, I'm not approaching the gender issue from an ideological or PC Principal perspective. My work will focus mostly on cognitive science and pedagogical factors – as you'll see below, I'm interested in diversity issues from lots of angles, but I don't subscribe to the diversity ideology that is fashionable in American academia.

One D-specific question I do have: Have any women ever posted here? I scoured a bunch of threads here recently and couldn't find a female poster. By this I mean a poster whose supplied name was female, where a proper name was supplied (some people just have usernames). Of course we don't really know who is posting, and there could be some George Eliot situations, but the presence/absence of self-identified women is useful enough. Women are underrepresented in programming, but the skew in online programming communities is even more extreme – we're seeing near-zero percent in lots of boards. This is not a D-specific problem. Does anyone know of occasions where women posted here? Links?

Getting back to the research, recent studies have argued that one reason women are underrepresented in certain STEM fields is that smart women have more options than smart men. So think of the right tail of the bell curve, the men and women in that region on the relevant aptitudes for STEM fields. There's some evidence that smart women have a broader set of skills -- *on average* -- than equivalently smart men, perhaps including better social skills (or more interest in social interaction). This probably fits with stereotypes and intuitions a lot of people already held (lots of stereotypes are accurate, as probability distributions and so forth).

I'm interested in monocultures and diversity issues in a number of domains. I've done some recent work on the lack of philosophical and political diversity in social science, particularly in social psychology, and how this has undermined the quality and validity of our research (here's a recent paper by me and my colleagues in Behavioral and Brain Sciences: http://dx.doi.org/10.1017/S0140525X14000430). My interest in the lack of gender diversity in programming is an entirely different research area, but there isn't much rigorous social science and cognitive psychology research on this topic, which surprised me. I think it's an important and interesting issue. I also think a lot of the diversity efforts that are salient in tech right now are acting far too late in the cycle, sort of just waiting for women and minorities to show up. The skew starts long before people graduate with a CS degree, and I think Google, Microsoft, Apple, Facebook, et al. should think deeply about how programming language design might be contributing to these effects (especially before they roll out any more C-like programming languages).

Informally, I think what's happening in many cases is that when smart women are exposed to programming, it looks ridiculous and they think something like "Screw this – I'm going to med school", or any of a thousand permutations of that sentiment.

Mainstream PL syntax is extremely unintuitive and poorly designed by known pedagogical, epistemological, and communicative science standards. The vast majority people who are introduced to programming do not pursue it (likely true of many fields, but programming may see a smaller grab than most – this point requires a lot more context). I'm open to the possibility that the need to master the bizarre syntax of incumbent programming languages might serve as a useful filter for qualities valuable in a programmer, but I'm not sure how good or precise the filter is.

Let me give you a sense of the sorts of issues I'm thinking of. Here is a C sample from ProgrammingSimplified.com. It finds the frequency of characters in a string:

int main()
{
   char string[100];
   int c = 0, count[26] = {0};

   printf("Enter a string\n");
   gets(string);

   while (string[c] != '\0')
   {
      /** Considering characters from 'a' to 'z' only
          and ignoring others */

      if (string[c] >= 'a' && string[c] <= 'z')
         count[string[c]-'a']++;

      c++;
   }

   for (c = 0; c < 26; c++)
   {
      /** Printing only those characters
          whose count is at least 1 */

      if (count[c] != 0)
printf("%c occurs %d times in the entered string.\n",c+'a',count[c]);
   }

   return 0;
}

There's a lot going on here from a learning, cognitive science and linguistic encoding standpoint.

1. There's no clear distinction between types and names. It's just plain text run-on phrases like "char string". string is an unfortunate name here, and reminds us that this would be a type in many modern languages, but my point here is that there's nothing to visually distinguish types from names. I would make types parenthetical or use a hashtag, so: MyString (char) or MyString #char (and definitely with types at the end of the declaration, with names and values up front and uninterrupted by type names – I'll be testing my hunches here).

2. There's some stuff about an integer c that equals 0, then something called count – it's not clear if this is a type or a name, since it's all by itself and doesn't follow the pattern we saw with int main and char string. It also seems to equal zero. Things that equal zero are strange in this context, and we often see bizarre x = 0 statements in programming when we don't mean it to actually equal zero, or not for long, but PL syntax usually doesn't include an explicit concept of a *starting value*, even though that's what it often is. We see this further down in the for loop.

3. The word *print* is being used to mean display on the screen. That's odd. Actually, the non-word printf is being used. We'd probably want to just say: display "Enter a string"

4. We switch the person or voice from an imperative "do this" as in printf, to some sort of narrator third-person voice with "gets". Who are we talking to? Who are we talking about? Who is getting? The alignment is the same as printf, and there's not an apparent actor or procedure that we would be referring to. (Relatedly, the third-person puts command that is so common in Ruby always makes me think of Silence of the Lambs – "It puts the lotion on its skin"... Or more recently, the third-person style of the Faceless Men, "a girl has no name", etc.)

5. Punctuation characters that already have strong semantics in English are used in ways that are inconsistent with and unrelated to those semantics. e.g. exclamation marks are jarring next to an equals sign, and it's not clear why such syntax is desirable. Same for percentage signs used to insert variables, rather than expressing a percentage. (I predict that the curly brace style of variable insertion in some HTML templating languages will be more intuitive for learners – they isolate the insertion, and don't have any conflicting semantics in normal English.)

I realize that some of this sprouted from the need to overload English punctuation in the ASCII-constrained computing world of the 1970s. The historical rationales for PL syntax decisions don't bear much on my research questions on learnability and the cognitive models people form when programming.

6. There are a bunch of semicolons and curly braces, and it's not clear why they're needed. Compilation will fail or the program will be broken if any of these characters are missing.

7. There are many other things going on here, lots of observations one could make from pedagogical, logical representation, and engineering standpoints.


Now, there are some reasonable hypotheses having to do with programming/tech culture and its effects on gender diversity. I think some of those can intertwine with PL design issues. I also think there might be an issue with the quality and compellingness of today's computing platforms, and the perceived power of computers to do amazing and interesting things. I don't think the platforms people are introduced to in CS education are very good at generating excitement about what computers can do. It would be interesting to gauge what sorts of things people think they might be able to create, what sorts of problems they think they could solve, or new interfaces they could implement, after their introduction to programming. What horizons do they see? For example, there used to be a lot of excitement about what computers could do for education. Those visions have not materialized, and it's not clear that computing is doing anything non-trivial in education for reasoning ability, unlocking math aptitude, writing creativity, etc. It might actually be a net harm, with its effects on attention spans and language development, though this will be very complicated to assess.

Mobile has reinvigorated some idealism and creativity about computing. But the platforms people are introduced to or forced to use when learning programming are not mobile platforms, since you can't build complex applications on the devices themselves. Unix and Linux are extremely popular in CS, but are terrible examples for blue sky thinking about computing. Forcing people to learn Vim or Emacs, grep, and poorly designed command line interfaces that dump a bunch of unformatted text at you are disastrous decisions from a pedagogical standpoint. (See the BlueJ project for an effort to do something about this.) They do nothing to illustrate what new and exciting things you could build with computers, and they seem to mold students into a rigid, conformist nix, git, and markdown monoculture where computing is reduced to bizarre manipulations of ASCII text on a black 1980s DOS-like screen, and constantly fiddling with and repairing one's operating system just to be able to continue to work on this DOS-like screen (Unix/Linux requires a lot of maintenance and troubleshooting overhead, especially for beginners – if they also have to do this while learning programming, then programming itself could be associated with a life of neverending, maddening hassles and frustrations). The debugging experience on Unix/Linux will be painful. From a pedagogical standpoint, this situation looks like a doomsday scenario, the worst CS education approach we could devise.

The nuisance/hassle overhead of programming is probably worth a few studies in conjunction with my studies on syntax, and I'd guess the issues are related – the chance of success in programming, in getting a simple program to just work, is pretty low. It's not clear that it *needs* to be so low, and I want to isolate any platform/toolchain factors from any PL syntax factors. (The factors may not exist – I could be wrong across the board.)

That's all I've got for now. This isn't as well-organized as I'd like, but I wanted to get something out now or I'd likely let it slip for weeks.

Reply via email to