I'm looking for a corpus/collection of Java source code. The corpus

This is one of the better ones:

Thanks, Derek!

should comprise multiple projects that come with JUnit test cases that
pass and have good test coverage.

This is the flying pig part of your request.

Wouldn't it be possible in theory?

I want to test a new programming construct that is supposed to shorten
programs without making them harder to understand. In the first instance

How do you plan to measure understanding?

That requires some info on the programming construct: I'm adding indirect anaphora to an extension of Java. Anaphora is a backward relation to a referent previously mentioned in the text, e.g. "He" in "James Gosling invented Java. He does not work for Sun anymore." Indirect anaphora is a backward relation to a referent that has not yet been mentioned in the text but is related to a previously mentioned referent. The relation can be a semantic or a conceptual one. In "An if-then-statement is executed by first evaluating the Expression.", "the Expression" is an indirect anaphor that refers to the expression that is part of an if-then-statement. The semantic information, that if-then-statements contain expressions is used to resolve the indirect anaphor.

I used an account of indirect anaphora resolution from cognitive linguistics as kind of a blue print for implementing indirect anaphora in an extension of Java. The underlying assumption is that the so-called text world model used in the cognitive account to resolve an indirect anaphor is equivalent to an AST constructed by a Java compiler. Also, conceptual schemata are assumed to be similar to class declaration, e.g. WRT to part-whole relations that both specify. Since text understanding is in cognitive linguistics described as the construction of a text world model and I treat the AST as if it was a text world model, one way to measure understanding would then be to measure how many nodes/relations the compiler creates in the AST.

I.e. if a compiler is constructed according to a cognitive theory of text understanding and both implementation and theory match human performance, if source code is successfully processed by a compiler without error, it will also be understood by a programmer.

To figure out whether the implementation of the compiler matches the theory as well as how humans understand text/source code, a controlled experiment could be used. IDEs provide functions like "go to declaration" to allow a programmer to get more info on a program element. One could count how often a programmer uses such functions for indirect anaphors, i.e. how often a programmer asks the IDE to present the referent of an indirect anaphor because he is not able to resolve it himself. The more often a programmer asks for the resolution of a referent, the lower his understanding of indirect anaphors in source code.


The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).

Reply via email to