Concerning "The Camel has Two Humps",
on 24 Jun 2007, at 8:02 pm, Jens Bennedsen wrote:
Michal Caspersen and I have replicated the study with the result "no correlation" - see http://db.grinnell.edu/sigcse/iticse2007/ Program/viewAcceptedProposal.asp?sessionType=paper&sessionNumber=51

It would be very pleasant if Dehnadi and Bornat had a result;
it would not be surprising if they didn't.  I have no opinion
about whether they are right or wrong, but it is clear that
their work should be replicated.

It is also clear that Casperson, Bennedsen, and Larsen *DIDN'T*
replicate Dehandi and Bornat's study and their paper has nothing
useful to tell us about anything much.  Frankly, I was deeply
dismayed that ITiCSE 2007 saw right to accept "Mental models and
Programming Aptitude", and that's the politest way I've been able
to express myself about it.

D&B claim   Measurement M1 applied to population P1 predicts
                result R1 after treatment T1 well.

C,B,&L claim        Measurement M1 applied to population P2 does not
                predict result R2 after treatment T2;
                measurement M2 applied to population P2 does not
                predict result R3 after treatment T2.

* The populations are dramatically and materially different.

        P1 is 61 students with NO PRIOR PROGRAMMING EXPERIENCE.

        P2 is 55 students with no prior programming experience
          AND 87 students with prior programming experience.

The C,B,&L paper fails to separate out the 55 possibly relevant
students from the 136 irrelevant students (irrelevant to the goal
of replication, that is).  For example, table 1 presents only
pooled pass/fail results.  If I am reading the paper correctly,
figure 2 does present results for all and only the relevant students,
but the predictor variable is not the predictor in D&B and the
reponse variable is not the response variable in D&B either, so
figure 2 doesn't really count as replication either.

From the numbers in the paper, it is IMPOSSIBLE to tell whether
D&B's claim is supported for the relevant population (the students
with no prior programming experience) or not.

* The treatments are dramatically and materially different.

T1 is a 12-week course designed for people with no prior programming
experience, which is such that there is about a 50% failure rate.
The Camel paper claims that 30%-60% is typical.  It appears to be
focused on fundamental concepts of programming.

T2 is a 7-week course in which a majority (61%) of students have
prior experience, which is such that there is about a 4% failure
rate.  Let me repeat that:  FOUR PERCENT.  The aim is for students
to know about "the role of conceptual modelling".  The highest
the failure rate for the relevant students could possibly be is
6/55 or about 11%.

The difference in failure rates is overwhelming.  Clearly, SOMETHING
drastically different is happening here.  If only the students with
prior experience were separated from the ones without in the reporting,
we might have some idea what.  At least the following possibilities
exist:
     The Aarhus course is not teaching the same thing as the
     Middlesex one, in which case it is unsurprising and uninformative
     that the ability to predict whether students are good at the
     Middlesex task should not transfer to the Aarhus task.

     The Aarhus teachers are some of the very best CS teachers in the
     world, far far better than the people getting "30%-60%" rates.
     This may well be true.

     The Aarhus students are some of the very best CS students in the
     world.  This may well be true too.  As we saw above, they are
     certainly different from the Middlesex ones, because for a clear
     majority of them this is NOT their first course.  From the D&B
     paper, it seems likely that the Middlesex/Barnet students were,
     um, not the pick of the crop.  That may be a factor too.

     The Aarhus examination is much easier than the Middlesex
     examination.  At this University, exam papers go into the library
     where anyone may inspect them, so I expect that both D&B and
     C,B,&L can and should make their examinations available for
     researchers to compare.

At any rate, never mind about measurement M, if Aarhus have a way to
teach CS1 that results in a 4% failure rate, I don't *CARE* about
anything else, I want to know how to do *THAT*.  That is a far FAR
more important result than replicating D&B or refuting them.

There's another difference.  D&B reported results based on 61 subjects
of whom 8% handed in a blank sheet.  (A further 30 subjects refused to
take part on grounds not related to the study.)  But C,B,&L report a
50% non-response rate (section 4.3).  That is a huge non-response rate
for a study like this.  We are not told how many in the non-response
group passed or how many failed; we could have been and we should have
been.

* Now we come to the statistics.

With such a small failure rate (4%)
it would have been extraordinarily difficult to demonstrate any
ability of M1 to predict R2.  In fact, once the small failure rate
had been discovered, there wasn't really any point in proceeding
further; they had failed to replicate D&B's setup closely enough
and there's an end to it.

Section 4.4 describes a discrete ordinal variable C with 6 levels.
Section 4.4 describes a discrete ordinal variable G with 5 levels.
These are not the variables that D&B used; the Camel paper makes
no claim about C and G as such.
Section 5.2 reports on a Pearson correlation (which is appropriate
for continuous variables with a Gaussian distribution) between C
and G.  From a statistical point of view, this is more than a little
dubious, and I would not expect the resulting number to mean anything.
I would at least expect the 6x5 table to be displayed so that readers
could compute meaningful statistics for themselves.

Section 5.2 describes a count variable C with values 0..12.
Section 5.2 describes a discretised time variable G with range 10..40
minutes.  It is rather confusing that these variables have the same
names as the ones in section 4.4, because section 4.4's G and second
5.2's G differ not just in number of levels but in what kind of
measurement is involved.  Figure 2 plots these variables.  It appears
that G is right-censored; 4 students hit a deadline without completing
the task, and so presumably failed.  If this interpretation is right,
then the failure rate for no-prior-programming-experience students
was 4/55 = about 7%, radically different from the Middlesex/Barnet 50%.
It's *almost* sensible to compute a correlation here, although I for
one would not be happy to do so without taking the right-censoring into
account.

In any case, figure 2 and the associated correlation coefficient have
no real relevance to an assessment of D&B's claims, because D&B don't
*make* any explicit or implicit claims about the time it takes someone
to complete the exam.

* Questioning the validity of questioning the validity

Section 6.2 of C,B,&L says that they interviewed "the 14 students who
were inconsistent but did pass the final exam."  This is a bit
worrying, because table 1 said there were 16 such students, not 14.
"Our harsh conclusion" is wholly unwarranted, because C,B,&L did not
interview the students who were *consistent* and passed; they merely
assert without any evidence at all that those students too were
guessing once up front and were merely lucky in their guess.  But it
is possible that the 'consistent' students acted like someone solving
a crossword puzzle:  you might think you know a word, but you don't
actually write it down until you have checked it against several of
the clues that cross it.  Maybe the 'consistent' students were looking
ahead to check their guess before writing any answers down.  Of
course it is possible that they *were* just lucky, but because they
weren't asked, we shall never know.

It would be useful if D&B could repeat their study and conduct
post-test interviews with students in all groups to find out what
strategy they were following.  It might be, for example, that the
really predictive thing is "look-ahead" -vs- "go back and fix up" -vs-
"dive in and never admit a fault".

All in all, I conclude that D&B's claim has yet to be seriously
challenged.


----------------------------------------------------------------------
PPIG Discuss List (discuss@ppig.org)
Discuss admin: http://limitlessmail.net/mailman/listinfo/discuss
Announce admin: http://limitlessmail.net/mailman/listinfo/announce
PPIG Discuss archive: http://www.mail-archive.com/discuss%40ppig.org/

Reply via email to