https://thegradient.pub/independently-reproducible-machine-learning/

Peer review has been an integral part of scientific research for [more than 300 
years](https://blogs.scientificamerican.com/information-culture/the-birth-of-modern-peer-review/).
 But even before peer review was introduced, reproducibility was a primary 
component of the scientific method. One of the first reproducible experiments 
was presented by Jabir Ibn Haiyan in 800 CE. In the past few decades, many 
domains have encountered high profile cases of non-reproducible results. The 
[American Psychological Association has struggled with authors failing to make 
data available](https://psycnet.apa.org/doi/10.1037/0003-066X.61.7.726). A 2011 
study found that only [6% of medical studies could be fully 
reproduced](https://doi.org/10.1038%2Fnrd3439-c1). In 2016, a survey of 
researchers from many disciplines found that most had [failed to reproduce one 
of their previous papers](https://doi.org/10.1038%2F533452a). Now, we hear 
warnings that Artificial Intelligence (AI) and Machine Learning (ML) [face 
their own reproducibility 
crises](https://science.sciencemag.org/content/359/6377/725).

This leads us to ask: is it true? It would seem hard to believe, as ML 
permeates every smart-device and intervenes evermore in our daily lives. From 
helpful hints on how to [act like a polite human over 
email](https://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html),
 to Elon Musk’s 
[promise](https://www.wired.com/story/elon-musk-tesla-full-self-driving-2019-2020-promise/)
 of self-driving cars next year, it seems like machine learning is indeed 
reproducible.

How reproducible is the latest ML research, and can we begin to quantify what 
impacts its reproducibility? This question served as motivation for my [NeurIPS 
2019 paper](https://arxiv.org/abs/1909.06674). Based on a combination of 
masochism and stubbornness, over the past eight years I have attempted to 
implement various ML algorithms from scratch. This has resulted in a ML library 
called [JSAT](https://github.com/EdwardRaff/JSAT). My investigation in 
reproducible ML has also relied on personal notes and records hosted on 
[Mendeley](https://www.mendeley.com/) and Github. With these data, and clearly 
no instinct for preserving my own sanity, I set out to quantify and verify 
reproducibility! As I soon learned, I would be engaging in 
[meta-science](https://en.wikipedia.org/wiki/Metascience), the study of science 
itself.

What is Reproducible Machine Learning?

One does not simply follow the description in the paper. 
https://abstrusegoose.com/588

Before we dive in, it is important to define what we mean by reproducible. 
Ideally, full reproducibility means that simply reading a scientific paper 
should give you all the information you need to 1) set up the same experiments, 
2) follow the same approach, and then 3) obtain similar results.

If we can get all the way to step 3 based solely on information present in the 
paper, we might call that independent reproducibility. In this example, our 
result is reproducible because we are able to get the same result, and 
independent because we have done so in an effort completely independent of the 
original publication.

But as our friend from the comic above might tell us, simply following the 
content of the paper isn’t always sufficient. If we can’t get to step 3 by 
using only the information in the paper (or from cited prior work), we would 
determine that the paper is not independently reproducible.

Some may wonder, why make this distinction between reproducibility and 
independent reproducibility? Almost all of AI and ML research is based on 
computer code. We don’t require the burden of expensive and labor-intensive 
chemical synthesis, waiting for bacteria in a petri dish to mature, or pesky 
human trials. It should be easy to simply get code from the authors, run that 
on the same data, and get the same results!

If you have never had to read a researcher's code before... you are doing 
pretty OK in life. Good job. 
http://phdcomics.com/comics/archive.php?comicid=1689

Our aversion to using or asking for the authors code is more than fear of 
working with undocumented research-grade code. [Chris 
Drummond](https://www.researchgate.net/profile/Chris_Drummond) has [described 
the approach](http://cogprints.org/7691/7/ICMLws09.pdf) of using an author’s 
code as replicability, and made a very salient argument that replication is 
desirable, but not sufficient for good science. A paper is supposed to be the 
scientific distillation of the work, representing what we have learned and now 
understand to enable these new results. If we can’t reproduce the results of a 
paper without the authors code, it may suggest that the paper itself didn’t 
successfully capture the important scientific contributions. This is before we 
consider the possibility that there may be bugs in the code that actually 
benefit the results, or any number of other possible discrepancies between code 
and paper.

Another [great 
example](http://proceedings.mlr.press/v97/bouthillier19a/bouthillier19a.pdf) 
from ICML this past year showed that even if we can replicate the results of a 
paper, slightly altering the experimental setup could have dramatically 
different results. For these reasons, we don’t want to consider the authors 
code, as this could be a source of bias. We want to focus on the question of 
reproducibility, without wading into the murky waters of replication.

What Makes a ML Paper Reproducible?

Feature Important       My Reaction
Hyperparameters ✅       👍
Easy to Read    ✅       👍
Equations per Page      ✅       🤔
Empirical vs Rigor      ✅       🤨
Pseudo Code     ✅       🤯
Replies to Questions    ✅       🤷
Include Toy Problems    ❌       😭
Year Published  ❌       😌
Open Source Code        ❌       😱Some of the features that were/were not 
related with reproducibility, that I found the most interesting.

I reviewed every paper I have attempted to implement up to 2017, and filtered 
out papers based on two criteria: if the attempt would be biased by having 
looked at released source code, or if there was a personal relationship with 
the authors. For each paper, I recorded as much information as I could to 
create a quantifiable set of features. Some were completely objective (how many 
authors where on the paper), while others highly subjective (does the paper 
look intimidating)? The goal of this analysis was to get as much information as 
possible about things that might impact a paper’s reproducibility. This left me 
with 255 attempted papers, and 162 successful reproductions. Each paper was 
distilled to a set of 26 features, and statistical testing was done to 
determine which were significant. In the table to the right I've put what I 
think are the most interest and important results, along with my initial 
reactions.

Some of the results where unsurprising. For example, the number of authors 
shouldn’t have any particular importance to a paper’s reproducibility, and it 
did not have a significant relationship. Hyperparameters are the knobs we can 
adjust to change an algorithms behavior, but are not learned by the algorithm 
itself. Instead, we humans must set their values (or devise a clever way to 
pick them). Whether or not a paper detailed the hyperparameters used was found 
to be significant, and we can intuit why. If you don’t tell the reader what the 
settings where, the reader has to guess. That takes work, time, and is error 
prone! So, some of our results have given credence to the ideas the community 
has already been pursuing in order to make papers more reproducible. What is 
important is that we can now quantify why these are good things to be pursuing. 
Other findings follow basic logic, such as the finding that papers that are 
easier to read are easier to reproduce, likely because they are easier to 
understand.

I implore you to read the paper for a deeper discussion, but there are a few 
additional results that I think are particularly interesting; either because 
they challenge our assumptions about what we “know” a good paper is or lead to 
some surprising conclusions. All of these results are nuanced more than I can 
unpack in this article, but are worth mentioning if for nothing else but to 
stimulate a deeper conversation, and hopefully spur further research to answer 
these questions.

Finding 1: Having fewer equations per page makes a paper more reproducible.

Math is like catnip for reviewers. They just can't help themselves. 
https://xkcd.com/982/

It appears to be the case because the most readable papers use the fewest 
equations. We often see papers that have many equations and derivations listed, 
for any number of reasons. It appears that a careful and judicious use of 
equations makes things easier to read, primarily because you can use math 
selectively to communicate more effectively. This result clashes with the 
incentive structure of getting a paper published. On more than one occasion, 
reviewers have asked me to include more math in a paper. It may be that the 
math itself makes a paper more scientific or grounded in objectivity. While 
more specification may seem to be better, it is not synonymous with 
reproducibility. This is a cultural issue we need to address as a community.

Finding 2: Empirical papers may be more reproducible than theory-oriented 
papers.

There is considerable debate within the community about where and how much 
rigor needs to be normalized in the community. This is done under the guide 
that as a community, our focus should be on getting the best results for a 
given bench mark. Yet in optimizing for bench marks, we risk losing our 
understanding of what is actually happening and why these methods work. The 
inclusion of theoretical work and formal proofs do not cover all aspects of 
what might be meant by the term rigor. Given the common belief that elaborate 
mathematical proofs ensure a better understanding of a given method, it is 
interesting to see that greater mathematical specification isn’t necessarily 
making research easier to reproduce. The important point here is that papers 
containing a mix of theory and empirical emphasis have the same overall 
reproduction rates as purely empirical papers. An empirical bent can be helpful 
from the reproducibility perspective, but [could also hamper 
progress](https://openreview.net/pdf?id=rJWF0Fywf) by creating perverse 
incentives and unintended side effects.

Finding 3: Sharing code is not a panacea

We have already touched upon the idea that reproduction via released code is 
not the same thing as reproduction done independently. Is this a difference 
without distinction? It is not! My results indicate that the open sourcing of 
code is at best a weak indicator of reproducibility. As conferences begin to 
more strongly encourage code submission and examination as part of the review 
process, I believe this is a crucial point. As a community, we need to 
understand what our goals are with such efforts and what we are actually 
accomplishing. Careful thought and consideration should go into this 
distinction if we ever make code submission mandatory, and the guidance we give 
reviewers to evaluate such code.

I find this result particularly noteworthy in terms of other people's 
reactions. While presenting at NeurIPS, many people commented on it. Half of 
them were certain that releasing would have been correlated with 
reproducibility , and the other half felt it obvious that the non-relationship 
would emerge. This strong contrast in opinions that were all deeply held is a 
perfect example of why I wanted to do this study. We don't really know until we 
sit down and measure it!

Finding 4: Having detailed pseudo code is just as reproducible as having no 
pseudo code.

Step-Code: Concise, but requires context from other parts of the paper to 
decipher.Standard-Code: Relatively detailed, can be almost self contained. 
Usually mathematical notation.Code-Like: Almost always self contained, easy to 
convert to code.

This finding challenged my assumptions of what constituted a good paper, but 
made more sense as I thought about the results. Somewhere in the paper, the 
process must be described. A computer scientist by training, I always preferred 
a type of description called pseudo code. But pseudo code can take many 
different forms. I categorized the papers it into four groups: None, Step-Code, 
Standard-Code, and Code-Like. I have some representative samples of these to 
the right from some widely reproduced papers that may or may not have been in 
this study!

I was shocked when Standard-Code and Code-Like has roughly equal reproduction 
rates, and floored to discover that None at all was just as good! However, 
cogent writing is just as effective in communicating a process. What was not as 
effective was so-called Step code, where a bulleted list of steps is listed, 
with each step referring to another section of the paper. Step code actually 
makes reading and understanding the paper harder, as the reader must now jump 
back and forth between different sections, rather than following a single 
sequential flow.

Finding 5: Creating simplified example problems do not appear to help with 
reproducibility.

This was another surprising result that I am still coming to grips with. I’ve 
always valued writers who can take a complex idea and boil it down to a simpler 
and more digestible form. I have likewise appreciated papers that create 
so-called toy problems. Toy problems which exemplify some property in a way 
that is easily visualized and turned into experiments. Subjectively, I always 
found simplified examples useful for understanding what a paper is trying to 
accomplish. Reproducing the toy problem was a useful tool in creating a smaller 
test case I could use for debugging. From an objective standpoint, simplified 
examples appear to provide no benefit for making a paper more reproducible. In 
fact, they do not even make papers more readable! I still struggle to 
understand and explain this result. This is exactly why it is important for us 
as a community to quantify these questions. If we do not do the work of 
quantifiction, we will never know that our work is tackling the issues most 
relevant to the research problem at hand.

Finding 6: Please, check your email

The last result I want to discuss is that replying to questions has a huge 
impact on a paper’s reproducibility. This result was expected, as not all 
papers rarely contain a perfect description of their methods. I emailed 50 
different authors with questions regarding how to reproduce their results. In 
the 24 cases where I never got a reply, I was able to reproduce their results 
only once (a 4% success rate). For the remaining 26 cases in which the author 
did respond, I was able to successfully reproduce 22 of the papers (an 85% 
success rate). I think this result is most interesting for what it implies 
about the publication process itself. What if we allowed published papers to be 
updated over time, without it becoming some kind of “new” publication? This 
way, authors could incorporate common feedback and questions into the original 
paper. This is already possible when papers are [posted on the 
arXiv](https://arxiv.org/); this should be the case for conference venue 
publications as well. These are things that could potentially advance science 
by increasing reproducibility, but only if we allow them to happen.

What Have We Learned?

Experts call this "hyperparameter tuning". https://xkcd.com/1838/

This work was inspired by the headline, “Artificial intelligence faces 
reproducibility crisis”. Is this headline hype or does it point to a systematic 
problem in the field? After completing this effort, my inclination is that 
there is room for improvement, but that we in the AI/ML field are doing a 
better job than most disciplines. A 62% success rate is higher than many 
meta-analyses from other sciences, and I suspect my 62% number is lower than 
reality. Others who are more familiar with research areas outside of my areas 
of expertise might be able to succeed where I have failed. Therefore, I 
consider the 62% estimate to be a lower bound.

One thing I want to make very clear: none of these results should be taken as a 
definitive statement on what is and what is not reproducible. There are a huge 
number of potential biases that may impact these results. Most obvious is that 
these 255 attempts at reproduction were all done by a single person. There are 
no community standards for internal consistency between meta-analysts. What I 
find easy to reproduce may be difficult for others, and vice-versa. For 
example, I couldn’t reproduce any of the Bayesian or fairness-based papers I 
attempted, but I don’t believe that these fields are irreproducible. My 
personal biases, in terms of background, education, resources, interests, and 
more, are all inseparable from the results obtained.

That said, I think this work provides strong evidence for a number of our 
communities’ current challenges while validating many reproducibility efforts 
currently under way in the community. The biggest factors are that we cannot 
take all of our assumptions about so-called reproducible ML at face value. 
These assumptions need to be tested, and I hope more than anything that this 
work will inspire others to begin quantifying and collecting this data for 
themselves. As a community, we are in a very unique position to perform 
meta-science on ourselves. The cost of replication is so much lower for us than 
for any other field of science. What we learn here could have impacts that 
extend beyond AI & ML to other subfields of Computer Science.

More than anything, I think this work reinforces how difficulties of evaluating 
the reproducibility of research. Considering each feature in isolation is a 
fairly simple way to approach this analysis. This analysis has already 
delivered a number of potential insights, unexpected results, and complexities. 
However, it does not begin to consider correlations among papers based on 
authors, and representing the data as a graph, or even just looking at 
non-linear interactions of the current features! This is why I’ve attempted to 
make [much of the data publicly 
available](https://github.com/EdwardRaff/Quantifying-Independently-Reproducible-ML)
 so that others can perform a deeper analysis.

Finally, it has been pointed out to me that I may have created the most 
unreproducible ML research ever. But in reality, it leads to a number of issues 
regarding how we do the science of meta-science, to study how we implement and 
evaluate our research. With that, I hope I’ve encouraged you to read my paper 
for further details and discussion. Think about how your own work fits into the 
larger picture of human knowledge and science. As the avalanche of new AI and 
ML research continues to grow, our ability to leverage and learn from all this 
work will be highly dependent on our ability to distill ever more knowledge 
down to a digestible form. At the same time, our process and systems must 
result in reproducible work that does not lead us astray. I have more work I 
would like to do in this space, and I hope you will join me.

[Dr. Edward Raff](https://www.edwardraff.com/) is a Chief Scientist at Booz 
Allen Hamilton, Visiting Professor at the University of Maryland, Baltimore 
County (UMBC), and author of the [JSAT](https://github.com/EdwardRaff/JSAT) 
machine learning library. Dr. Raff leads the machine learning research team at 
Booz Allen, while also supporting clients who have advanced ML needs. He 
received his BS and MS in Computer Science from Purdue University, and his PhD 
from UMBC. You can follow him on [Twitter 
@EdwardRaffML](https://twitter.com/EdwardRaffML).

Reply via email to