Concerning "The Camel has Two Humps",
on 24 Jun 2007, at 8:02 pm, Jens Bennedsen wrote:
Michal Caspersen and I have replicated the study with the result
"no correlation" - see http://db.grinnell.edu/sigcse/iticse2007/
Program/viewAcceptedProposal.asp?sessionType=paper&sessionNumber=51
It would be very pleasant if Dehnadi and Bornat had a result;
it would not be surprising if they didn't. I have no opinion
about whether they are right or wrong, but it is clear that
their work should be replicated.
It is also clear that Casperson, Bennedsen, and Larsen *DIDN'T*
replicate Dehandi and Bornat's study and their paper has nothing
useful to tell us about anything much. Frankly, I was deeply
dismayed that ITiCSE 2007 saw right to accept "Mental models and
Programming Aptitude", and that's the politest way I've been able
to express myself about it.
D&B claim Measurement M1 applied to population P1 predicts
result R1 after treatment T1 well.
C,B,&L claimMeasurement M1 applied to population P2 does not
predict result R2 after treatment T2;
measurement M2 applied to population P2 does not
predict result R3 after treatment T2.
* The populations are dramatically and materially different.
P1 is 61 students with NO PRIOR PROGRAMMING EXPERIENCE.
P2 is 55 students with no prior programming experience
AND 87 students with prior programming experience.
The C,B,&L paper fails to separate out the 55 possibly relevant
students from the 136 irrelevant students (irrelevant to the goal
of replication, that is). For example, table 1 presents only
pooled pass/fail results. If I am reading the paper correctly,
figure 2 does present results for all and only the relevant students,
but the predictor variable is not the predictor in D&B and the
reponse variable is not the response variable in D&B either, so
figure 2 doesn't really count as replication either.
From the numbers in the paper, it is IMPOSSIBLE to tell whether
D&B's claim is supported for the relevant population (the students
with no prior programming experience) or not.
* The treatments are dramatically and materially different.
T1 is a 12-week course designed for people with no prior programming
experience, which is such that there is about a 50% failure rate.
The Camel paper claims that 30%-60% is typical. It appears to be
focused on fundamental concepts of programming.
T2 is a 7-week course in which a majority (61%) of students have
prior experience, which is such that there is about a 4% failure
rate. Let me repeat that: FOUR PERCENT. The aim is for students
to know about "the role of conceptual modelling". The highest
the failure rate for the relevant students could possibly be is
6/55 or about 11%.
The difference in failure rates is overwhelming. Clearly, SOMETHING
drastically different is happening here. If only the students with
prior experience were separated from the ones without in the reporting,
we might have some idea what. At least the following possibilities
exist:
The Aarhus course is not teaching the same thing as the
Middlesex one, in which case it is unsurprising and uninformative
that the ability to predict whether students are good at the
Middlesex task should not transfer to the Aarhus task.
The Aarhus teachers are some of the very best CS teachers in the
world, far far better than the people getting "30%-60%" rates.
This may well be true.
The Aarhus students are some of the very best CS students in the
world. This may well be true too. As we saw above, they are
certainly different from the Middlesex ones, because for a clear
majority of them this is NOT their first course. From the D&B
paper, it seems likely that the Middlesex/Barnet students were,
um, not the pick of the crop. That may be a factor too.
The Aarhus examination is much easier than the Middlesex
examination. At this University, exam papers go into the library
where anyone may inspect them, so I expect that both D&B and
C,B,&L can and should make their examinations available for
researchers to compare.
At any rate, never mind about measurement M, if Aarhus have a way to
teach CS1 that results in a 4% failure rate, I don't *CARE* about
anything else, I want to know how to do *THAT*. That is a far FAR
more important result than replicating D&B or refuting them.
There's another difference. D&B reported results based on 61 subjects
of whom 8% handed in a blank sheet. (A further 30 subjects refused to
take part on grounds not related to the study.) But C,B,&L report a
50% non-response rate (section 4.3). That is a huge non-response rate
for a study like this. We are not told how many in the non-response
group passed or how many failed; we could have been and we should have
been.
* Now we come to the statistics.
With such a small failure rate (4%)
it would have be