I'm looking for a corpus/collection of Java source code. The corpus
This is one of the better ones:
should comprise multiple projects that come with JUnit test cases that
pass and have good test coverage.
This is the flying pig part of your request.
Wouldn't it be possible in theory?
I want to test a new programming construct that is supposed to shorten
programs without making them harder to understand. In the first instance
How do you plan to measure understanding?
That requires some info on the programming construct: I'm adding
indirect anaphora to an extension of Java. Anaphora is a backward
relation to a referent previously mentioned in the text, e.g. "He" in
"James Gosling invented Java. He does not work for Sun anymore."
Indirect anaphora is a backward relation to a referent that has not yet
been mentioned in the text but is related to a previously mentioned
referent. The relation can be a semantic or a conceptual one. In "An
if-then-statement is executed by first evaluating the Expression.", "the
Expression" is an indirect anaphor that refers to the expression that is
part of an if-then-statement. The semantic information, that
if-then-statements contain expressions is used to resolve the indirect
I used an account of indirect anaphora resolution from cognitive
linguistics as kind of a blue print for implementing indirect anaphora
in an extension of Java. The underlying assumption is that the so-called
text world model used in the cognitive account to resolve an indirect
anaphor is equivalent to an AST constructed by a Java compiler. Also,
conceptual schemata are assumed to be similar to class declaration, e.g.
WRT to part-whole relations that both specify. Since text understanding
is in cognitive linguistics described as the construction of a text
world model and I treat the AST as if it was a text world model, one way
to measure understanding would then be to measure how many
nodes/relations the compiler creates in the AST.
I.e. if a compiler is constructed according to a cognitive theory of
text understanding and both implementation and theory match human
performance, if source code is successfully processed by a compiler
without error, it will also be understood by a programmer.
To figure out whether the implementation of the compiler matches the
theory as well as how humans understand text/source code, a controlled
experiment could be used. IDEs provide functions like "go to
declaration" to allow a programmer to get more info on a program
element. One could count how often a programmer uses such functions for
indirect anaphors, i.e. how often a programmer asks the IDE to present
the referent of an indirect anaphor because he is not able to resolve it
himself. The more often a programmer asks for the resolution of a
referent, the lower his understanding of indirect anaphors in source code.
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity
in England & Wales and a charity registered in Scotland (SC 038302).