Next year one of my R programming assigments will read like this:

"Here is some R code written by a multi-million dollar AI system to
compute [something]. It doesn't work. Fix the bugs, then ask the AI to
write a letter to its creators apologising for how rubbish it is at
coding. Collect one million dollars."


On Sun, Dec 18, 2022 at 7:01 PM Boris Steipe <boris.ste...@utoronto.ca> wrote:
>
> This email originated outside the University. Check before clicking links or 
> attachments.
>
> Technically not a help question. But crucial to be aware of, especially for 
> those of us in academia, or otherwise teaching R. I am not aware of a 
> suitable alternate forum. If this does not interest you, please simply ignore 
> - I already know that this may be somewhat OT.
>
> Thanks.
> ------------------------------------------------------
>
> You very likely have heard of ChatGPT, the conversation interface on top of 
> the GPT-3 large language model and that it can generate code. I thought it 
> doesn't do R - I was wrong. Here is a little experiment:
>
> MY PROMPT:
> ==========
> Please write me some R code that will read a tsv file, remove all rows that 
> have duplicates in the column "names", and store the median values of the 
> remaining rows in a variable called "Medians". Of course, that calculation 
> should only be done with columns that are numeric. Print the last variable in 
> order from largest to smallest.
>
>
> AI RESPONSE:
> ============
>
> Here is some R code that should accomplish what you've described:
>
> (Note: the code is formatted in a little terminal window, complete with a 
> button to "copy code" at the top.)
>
> ----
> # Read the tsv file
> data <- read.table("file.tsv", header = TRUE, sep = "\t")
>
> # Remove rows with duplicates in the "names" column
> data <- data[!duplicated(data$names), ]
>
> # Select only the numeric columns
> num_columns <- sapply(data, is.numeric)
> numeric_data <- data[, num_columns]
>
> # Calculate the medians for each column
> Medians <- apply(numeric_data, 1, median)
>
> # Sort Medians in descending order
> Medians <- sort(Medians, decreasing = TRUE)
>
> # Print Medians
> print(Medians)
> ----
>
> This code assumes that the tsv file is called "file.tsv" and is stored in the 
> current working directory. It also assumes that the first row of the file 
> contains the column names.
>
>
>
> MY PROMPT:
> ==========
> Can you provide me with a bit of sample data that I can use to test this code?
>
> AI RESPONSE:
> ============
> Certainly! Here is some sample data that you can use to test the code I 
> provided:
>
> ----
> name    age     height  weight
> Alice   23      5.6     120
> Bob     32      6.2     180
> Charlie 29      5.9     160
> Alice   23      5.6     120
> David   25      5.8     140
> Eve     27      5.5     110
> ----
>
> You can save this data as a tsv file and use it to test the code.
>
>
> ------------------------------------------------------
>
> Notes:
> 1) The code, as written does not work with the test data. The reason is that 
> the test data specifies a column "name", but the code assumes "names" (which 
> I specified in the instructions). Once this is fixed in the test data, the 
> code executes.
>
> 2) The instructions are not accurately translated: I had asked for row means. 
> But changing the margin in the apply() statement from 2 to 1 does the correct 
> thing.
>
> 3) Note that the test data contains both numeric and non-numeric columns. 
> Also, the name "Alice" is duplicated in the test data, which is subtle, and 
> the right thing to do.
>
>
> ------------------------------------------------------
>
> A similar experiment, run on the beta-site of OpenAI's Codex code-completion 
> tool gives the following code (complete with the comments.):
>
> ----
> # Read in the data
> data <- read.table("mydata.tsv", sep = "\t", header = TRUE)
>
> # Remove all rows that have duplicates in the column "names"
> unique_names <- unique(data$names)
> data_unique <- data[data$names %in% unique_names,]
>
> # Calculate medians
> Medians <- sapply(data_unique[, sapply(data_unique, is.numeric)], median, 
> na.rm = TRUE)
>
> # Print last variable in order form largest to smallest
> rev(sort(Medians[length(Medians)]))
> ----
>
> Note that the strategy is quite different (e.g using %in%, not duplicated() 
> ), the interpretation of "last variable" is technically correct but not what 
> I had in mind (ChatGPT got that right though).
>
>
> Changing my prompts slightly resulted it going for a dplyr solution instead, 
> complete with %>% idioms etc ... again, syntactically correct but not giving 
> me the fully correct results.
>
> ------------------------------------------------------
>
> Bottom line: The AI's ability to translate natural language instructions into 
> code is astounding. Errors the AI makes are subtle and probably not easy to 
> fix if you don't already know what you are doing. But the way that this can 
> be "confidently incorrect" and plausible makes it nearly impossible to detect 
> unless you actually run the code (you may have noticed that when you read the 
> code).
>
> Will our students use it? Absolutely.
>
> Will they successfully cheat with it? That depends on the assignment. We 
> probably need to _encourage_ them to use it rather than sanction - but 
> require them to attribute the AI, document prompts, and identify their own, 
> additional contributions.
>
> Will it help them learn? When you are aware of the issues, it may be quite 
> useful. It may be especially useful to teach them to specify their code 
> carefully and completely, and to ask questions in the right way. Test cases 
> are crucial.
>
> How will it affect what we do as instructors? I don't know. Really.
>
> And the future? I am not pleased to extrapolate to a job market in which they 
> compete with knowledge workers who work 24/7 without benefits, vacation pay, 
> or even a salary. They'll need to rethink the value of their investment in an 
> academic education. We'll need to rethink what we do to provide value above 
> and beyond what AI's can do. (Nb. all of the arguments I hear about why 
> humans will always be better etc. are easily debunked, but that's even more 
> OT :-)
>
> --------------------------------------------------------
>
> If you have thoughts to share how your institution is thinking about academic 
> integrity in this situation, or creative ideas how to integrate this into 
> teaching, I'd love to hear from you.
>
>
> All the best!
> Boris
>
>
> --
> Boris Steipe MD, PhD
> University of Toronto
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to