Quantifying Independently Reproducible Machine Learning

coderman Thu, 05 Mar 2020 14:33:29 -0800

https://thegradient.pub/independently-reproducible-machine-learning/

Peer review has been an integral part of scientific research for [more than 300
years](https://blogs.scientificamerican.com/information-culture/the-birth-of-modern-peer-review/).
But even before peer review was introduced, reproducibility was a primary
component of the scientific method. One of the first reproducible experiments
was presented by Jabir Ibn Haiyan in 800 CE. In the past few decades, many
domains have encountered high profile cases of non-reproducible results. The
[American Psychological Association has struggled with authors failing to make
data available](https://psycnet.apa.org/doi/10.1037/0003-066X.61.7.726). A 2011
study found that only [6% of medical studies could be fully
reproduced](https://doi.org/10.1038%2Fnrd3439-c1). In 2016, a survey of
researchers from many disciplines found that most had [failed to reproduce one
of their previous papers](https://doi.org/10.1038%2F533452a). Now, we hear
warnings that Artificial Intelligence (AI) and Machine Learning (ML) [face
their own reproducibility
crises](https://science.sciencemag.org/content/359/6377/725).

This leads us to ask: is it true? It would seem hard to believe, as ML
permeates every smart-device and intervenes evermore in our daily lives. From
helpful hints on how to [act like a polite human over
email](https://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html),
to Elon Musk’s
[promise](https://www.wired.com/story/elon-musk-tesla-full-self-driving-2019-2020-promise/)
of self-driving cars next year, it seems like machine learning is indeed
reproducible.

How reproducible is the latest ML research, and can we begin to quantify what
impacts its reproducibility? This question served as motivation for my [NeurIPS
2019 paper](https://arxiv.org/abs/1909.06674). Based on a combination of
masochism and stubbornness, over the past eight years I have attempted to
implement various ML algorithms from scratch. This has resulted in a ML library
called [JSAT](https://github.com/EdwardRaff/JSAT). My investigation in
reproducible ML has also relied on personal notes and records hosted on
[Mendeley](https://www.mendeley.com/) and Github. With these data, and clearly
no instinct for preserving my own sanity, I set out to quantify and verify
reproducibility! As I soon learned, I would be engaging in
[meta-science](https://en.wikipedia.org/wiki/Metascience), the study of science
itself.

What is Reproducible Machine Learning?

One does not simply follow the description in the paper.
https://abstrusegoose.com/588

Before we dive in, it is important to define what we mean by reproducible.
Ideally, full reproducibility means that simply reading a scientific paper
should give you all the information you need to 1) set up the same experiments,
2) follow the same approach, and then 3) obtain similar results.

If we can get all the way to step 3 based solely on information present in the
paper, we might call that independent reproducibility. In this example, our
result is reproducible because we are able to get the same result, and
independent because we have done so in an effort completely independent of the
original publication.

But as our friend from the comic above might tell us, simply following the
content of the paper isn’t always sufficient. If we can’t get to step 3 by
using only the information in the paper (or from cited prior work), we would
determine that the paper is not independently reproducible.

Some may wonder, why make this distinction between reproducibility and
independent reproducibility? Almost all of AI and ML research is based on
computer code. We don’t require the burden of expensive and labor-intensive
chemical synthesis, waiting for bacteria in a petri dish to mature, or pesky
human trials. It should be easy to simply get code from the authors, run that
on the same data, and get the same results!

If you have never had to read a researcher's code before... you are doing
pretty OK in life. Good job.
http://phdcomics.com/comics/archive.php?comicid=1689

Our aversion to using or asking for the authors code is more than fear of
working with undocumented research-grade code. [Chris
Drummond](https://www.researchgate.net/profile/Chris_Drummond) has [described
the approach](http://cogprints.org/7691/7/ICMLws09.pdf) of using an author’s
code as replicability, and made a very salient argument that replication is
desirable, but not sufficient for good science. A paper is supposed to be the
scientific distillation of the work, representing what we have learned and now
understand to enable these new results. If we can’t reproduce the results of a
paper without the authors code, it may suggest that the paper itself didn’t
successfully capture the important scientific contributions. This is before we
consider the possibility that there may be bugs in the code that actually
benefit the results, or any number of other possible discrepancies between code
and paper.

Another [great
example](http://proceedings.mlr.press/v97/bouthillier19a/bouthillier19a.pdf)
from ICML this past year showed that even if we can replicate the results of a
paper, slightly altering the experimental setup could have dramatically
different results. For these reasons, we don’t want to consider the authors
code, as this could be a source of bias. We want to focus on the question of
reproducibility, without wading into the murky waters of replication.

What Makes a ML Paper Reproducible?

Feature Important My Reaction
Hyperparameters ✅ 👍
Easy to Read ✅ 👍
Equations per Page ✅ 🤔
Empirical vs Rigor ✅ 🤨
Pseudo Code ✅ 🤯
Replies to Questions ✅ 🤷
Include Toy Problems ❌ 😭
Year Published ❌ 😌
Open Source Code ❌ 😱Some of the features that were/were not
related with reproducibility, that I found the most interesting.

I reviewed every paper I have attempted to implement up to 2017, and filtered
out papers based on two criteria: if the attempt would be biased by having
looked at released source code, or if there was a personal relationship with
the authors. For each paper, I recorded as much information as I could to
create a quantifiable set of features. Some were completely objective (how many
authors where on the paper), while others highly subjective (does the paper
look intimidating)? The goal of this analysis was to get as much information as
possible about things that might impact a paper’s reproducibility. This left me
with 255 attempted papers, and 162 successful reproductions. Each paper was
distilled to a set of 26 features, and statistical testing was done to
determine which were significant. In the table to the right I've put what I
think are the most interest and important results, along with my initial
reactions.

Some of the results where unsurprising. For example, the number of authors
shouldn’t have any particular importance to a paper’s reproducibility, and it
did not have a significant relationship. Hyperparameters are the knobs we can
adjust to change an algorithms behavior, but are not learned by the algorithm
itself. Instead, we humans must set their values (or devise a clever way to
pick them). Whether or not a paper detailed the hyperparameters used was found
to be significant, and we can intuit why. If you don’t tell the reader what the
settings where, the reader has to guess. That takes work, time, and is error
prone! So, some of our results have given credence to the ideas the community
has already been pursuing in order to make papers more reproducible. What is
important is that we can now quantify why these are good things to be pursuing.
Other findings follow basic logic, such as the finding that papers that are
easier to read are easier to reproduce, likely because they are easier to
understand.

I implore you to read the paper for a deeper discussion, but there are a few
additional results that I think are particularly interesting; either because
they challenge our assumptions about what we “know” a good paper is or lead to
some surprising conclusions. All of these results are nuanced more than I can
unpack in this article, but are worth mentioning if for nothing else but to
stimulate a deeper conversation, and hopefully spur further research to answer
these questions.

Finding 1: Having fewer equations per page makes a paper more reproducible.

Math is like catnip for reviewers. They just can't help themselves.
https://xkcd.com/982/

It appears to be the case because the most readable papers use the fewest
equations. We often see papers that have many equations and derivations listed,
for any number of reasons. It appears that a careful and judicious use of
equations makes things easier to read, primarily because you can use math
selectively to communicate more effectively. This result clashes with the
incentive structure of getting a paper published. On more than one occasion,
reviewers have asked me to include more math in a paper. It may be that the
math itself makes a paper more scientific or grounded in objectivity. While
more specification may seem to be better, it is not synonymous with
reproducibility. This is a cultural issue we need to address as a community.

Finding 2: Empirical papers may be more reproducible than theory-oriented
papers.

There is considerable debate within the community about where and how much
rigor needs to be normalized in the community. This is done under the guide
that as a community, our focus should be on getting the best results for a
given bench mark. Yet in optimizing for bench marks, we risk losing our
understanding of what is actually happening and why these methods work. The
inclusion of theoretical work and formal proofs do not cover all aspects of
what might be meant by the term rigor. Given the common belief that elaborate
mathematical proofs ensure a better understanding of a given method, it is
interesting to see that greater mathematical specification isn’t necessarily
making research easier to reproduce. The important point here is that papers
containing a mix of theory and empirical emphasis have the same overall
reproduction rates as purely empirical papers. An empirical bent can be helpful
from the reproducibility perspective, but [could also hamper
progress](https://openreview.net/pdf?id=rJWF0Fywf) by creating perverse
incentives and unintended side effects.

Finding 3: Sharing code is not a panacea

We have already touched upon the idea that reproduction via released code is
not the same thing as reproduction done independently. Is this a difference
without distinction? It is not! My results indicate that the open sourcing of
code is at best a weak indicator of reproducibility. As conferences begin to
more strongly encourage code submission and examination as part of the review
process, I believe this is a crucial point. As a community, we need to
understand what our goals are with such efforts and what we are actually
accomplishing. Careful thought and consideration should go into this
distinction if we ever make code submission mandatory, and the guidance we give
reviewers to evaluate such code.

I find this result particularly noteworthy in terms of other people's
reactions. While presenting at NeurIPS, many people commented on it. Half of
them were certain that releasing would have been correlated with
reproducibility , and the other half felt it obvious that the non-relationship
would emerge. This strong contrast in opinions that were all deeply held is a
perfect example of why I wanted to do this study. We don't really know until we
sit down and measure it!

Finding 4: Having detailed pseudo code is just as reproducible as having no
pseudo code.

Step-Code: Concise, but requires context from other parts of the paper to
decipher.Standard-Code: Relatively detailed, can be almost self contained.
Usually mathematical notation.Code-Like: Almost always self contained, easy to
convert to code.

This finding challenged my assumptions of what constituted a good paper, but
made more sense as I thought about the results. Somewhere in the paper, the
process must be described. A computer scientist by training, I always preferred
a type of description called pseudo code. But pseudo code can take many
different forms. I categorized the papers it into four groups: None, Step-Code,
Standard-Code, and Code-Like. I have some representative samples of these to
the right from some widely reproduced papers that may or may not have been in
this study!

I was shocked when Standard-Code and Code-Like has roughly equal reproduction
rates, and floored to discover that None at all was just as good! However,
cogent writing is just as effective in communicating a process. What was not as
effective was so-called Step code, where a bulleted list of steps is listed,
with each step referring to another section of the paper. Step code actually
makes reading and understanding the paper harder, as the reader must now jump
back and forth between different sections, rather than following a single
sequential flow.

Finding 5: Creating simplified example problems do not appear to help with
reproducibility.

This was another surprising result that I am still coming to grips with. I’ve
always valued writers who can take a complex idea and boil it down to a simpler
and more digestible form. I have likewise appreciated papers that create
so-called toy problems. Toy problems which exemplify some property in a way
that is easily visualized and turned into experiments. Subjectively, I always
found simplified examples useful for understanding what a paper is trying to
accomplish. Reproducing the toy problem was a useful tool in creating a smaller
test case I could use for debugging. From an objective standpoint, simplified
examples appear to provide no benefit for making a paper more reproducible. In
fact, they do not even make papers more readable! I still struggle to
understand and explain this result. This is exactly why it is important for us
as a community to quantify these questions. If we do not do the work of
quantifiction, we will never know that our work is tackling the issues most
relevant to the research problem at hand.

Finding 6: Please, check your email

The last result I want to discuss is that replying to questions has a huge
impact on a paper’s reproducibility. This result was expected, as not all
papers rarely contain a perfect description of their methods. I emailed 50
different authors with questions regarding how to reproduce their results. In
the 24 cases where I never got a reply, I was able to reproduce their results
only once (a 4% success rate). For the remaining 26 cases in which the author
did respond, I was able to successfully reproduce 22 of the papers (an 85%
success rate). I think this result is most interesting for what it implies
about the publication process itself. What if we allowed published papers to be
updated over time, without it becoming some kind of “new” publication? This
way, authors could incorporate common feedback and questions into the original
paper. This is already possible when papers are [posted on the
arXiv](https://arxiv.org/); this should be the case for conference venue
publications as well. These are things that could potentially advance science
by increasing reproducibility, but only if we allow them to happen.

What Have We Learned?

Experts call this "hyperparameter tuning". https://xkcd.com/1838/

This work was inspired by the headline, “Artificial intelligence faces
reproducibility crisis”. Is this headline hype or does it point to a systematic
problem in the field? After completing this effort, my inclination is that
there is room for improvement, but that we in the AI/ML field are doing a
better job than most disciplines. A 62% success rate is higher than many
meta-analyses from other sciences, and I suspect my 62% number is lower than
reality. Others who are more familiar with research areas outside of my areas
of expertise might be able to succeed where I have failed. Therefore, I
consider the 62% estimate to be a lower bound.

One thing I want to make very clear: none of these results should be taken as a
definitive statement on what is and what is not reproducible. There are a huge
number of potential biases that may impact these results. Most obvious is that
these 255 attempts at reproduction were all done by a single person. There are
no community standards for internal consistency between meta-analysts. What I
find easy to reproduce may be difficult for others, and vice-versa. For
example, I couldn’t reproduce any of the Bayesian or fairness-based papers I
attempted, but I don’t believe that these fields are irreproducible. My
personal biases, in terms of background, education, resources, interests, and
more, are all inseparable from the results obtained.

That said, I think this work provides strong evidence for a number of our
communities’ current challenges while validating many reproducibility efforts
currently under way in the community. The biggest factors are that we cannot
take all of our assumptions about so-called reproducible ML at face value.
These assumptions need to be tested, and I hope more than anything that this
work will inspire others to begin quantifying and collecting this data for
themselves. As a community, we are in a very unique position to perform
meta-science on ourselves. The cost of replication is so much lower for us than
for any other field of science. What we learn here could have impacts that
extend beyond AI & ML to other subfields of Computer Science.

More than anything, I think this work reinforces how difficulties of evaluating
the reproducibility of research. Considering each feature in isolation is a
fairly simple way to approach this analysis. This analysis has already
delivered a number of potential insights, unexpected results, and complexities.
However, it does not begin to consider correlations among papers based on
authors, and representing the data as a graph, or even just looking at
non-linear interactions of the current features! This is why I’ve attempted to
make [much of the data publicly
available](https://github.com/EdwardRaff/Quantifying-Independently-Reproducible-ML)
so that others can perform a deeper analysis.

Finally, it has been pointed out to me that I may have created the most
unreproducible ML research ever. But in reality, it leads to a number of issues
regarding how we do the science of meta-science, to study how we implement and
evaluate our research. With that, I hope I’ve encouraged you to read my paper
for further details and discussion. Think about how your own work fits into the
larger picture of human knowledge and science. As the avalanche of new AI and
ML research continues to grow, our ability to leverage and learn from all this
work will be highly dependent on our ability to distill ever more knowledge
down to a digestible form. At the same time, our process and systems must
result in reproducible work that does not lead us astray. I have more work I
would like to do in this space, and I hope you will join me.

[Dr. Edward Raff](https://www.edwardraff.com/) is a Chief Scientist at Booz
Allen Hamilton, Visiting Professor at the University of Maryland, Baltimore
County (UMBC), and author of the [JSAT](https://github.com/EdwardRaff/JSAT)
machine learning library. Dr. Raff leads the machine learning research team at
Booz Allen, while also supporting clients who have advanced ML needs. He
received his BS and MS in Computer Science from Purdue University, and his PhD
from UMBC. You can follow him on [Twitter
@EdwardRaffML](https://twitter.com/EdwardRaffML).

Quantifying Independently Reproducible Machine Learning

Reply via email to