If you go to the Downloads <http://spark.apache.org/downloads.html> page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t use to be this way. As recently as Spark 2.2.0, downloads were served via CloudFront <https://aws.amazon.com/cloudfront/>, which was backed by an S3 bucket named spark-related-packages.
It seems that we’ve stopped using CloudFront, and the S3 bucket behind it has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing this is part of an effort to use the Apache mirror network, like other Apache projects do. >From a user perspective, the Apache mirror network is several steps down from using a modern CDN. Let me summarize why: 1. *Apache mirrors are often slow.* Apache does not impose any performance requirements on its mirrors <https://issues.apache.org/jira/browse/INFRA-10999?focusedCommentId=15717950&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15717950>. The difference between getting a good mirror and a bad one means downloading Spark in less than a minute vs. 20 minutes. The problem is so bad that I’ve thought about adding an Apache mirror blacklist <https://github.com/nchammas/flintrock/issues/84#issuecomment-185038678> to Flintrock to avoid getting one of these dud mirrors. 2. *Apache mirrors are inconvenient to use.* When you download something from an Apache mirror, you get a link like this one <https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz>. Instead of automatically redirecting you to your download, though, you need to process the results you get back <https://github.com/nchammas/flintrock/blob/67bf84a1b7cfa1c276cf57ecd8a0b27613ad2698/flintrock/scripts/download-hadoop.py#L21-L42> to find your download target. And you need to handle the high download failure rate, since sometimes the mirror you get doesn’t have the file it claims to have. 3. *Apache mirrors are incomplete.* Apache mirrors only keep around the latest releases, save for a few “archive” mirrors, which are often slow. So if you want to download anything but the latest version of Spark, you are out of luck. Some of these problems can be mitigated by picking a specific mirror that works well and hardcoding it in your scripts, but that defeats the purpose of dynamically selecting a mirror and makes you a “bad” user of the mirror network. I raised some of these issues over on INFRA-10999 <https://issues.apache.org/jira/browse/INFRA-10999>. The ticket sat for a year before I heard anything back, and the bottom line was that none of the above problems have a solution on the horizon. It’s fine. I understand that Apache is a volunteer organization and that the infrastructure team has a lot to manage as it is. I still find it disappointing that an organization of Apache’s stature doesn’t have a better solution for this in collaboration with a third party. Python serves PyPI downloads using Fastly <https://www.fastly.com/> and Homebrew serves packages using Bintray <https://bintray.com/>. They both work really, really well. Why don’t we have something as good for Apache projects? Anyway, that’s a separate discussion. What I want to say is this: Dear whoever owns the spark-related-packages S3 bucket <https://s3.amazonaws.com/spark-related-packages/>, Please keep the bucket up-to-date with the latest Spark releases, alongside the past releases that are already on there. It’s a huge help to the Flintrock <https://github.com/nchammas/flintrock> project, and it’s an equally big help to those of us writing infrastructure automation scripts that deploy Spark in other contexts. I understand that hosting this stuff is not free, and that I am not paying anything for this service. If it needs to go, so be it. But I wanted to take this opportunity to lay out the benefits I’ve enjoyed thanks to having this bucket around, and to make sure that if it did die, it didn’t die a quiet death. Sincerely, Nick