If you go to the Downloads <http://spark.apache.org/downloads.html> page
and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t
use to be this way. As recently as Spark 2.2.0, downloads were served via
CloudFront <https://aws.amazon.com/cloudfront/>, which was backed by an S3
bucket named spark-related-packages.

It seems that we’ve stopped using CloudFront, and the S3 bucket behind it
has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing
this is part of an effort to use the Apache mirror network, like other
Apache projects do.

>From a user perspective, the Apache mirror network is several steps down
from using a modern CDN. Let me summarize why:

   1. *Apache mirrors are often slow.* Apache does not impose any
   performance requirements on its mirrors
   
<https://issues.apache.org/jira/browse/INFRA-10999?focusedCommentId=15717950&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15717950>.
   The difference between getting a good mirror and a bad one means
   downloading Spark in less than a minute vs. 20 minutes. The problem is so
   bad that I’ve thought about adding an Apache mirror blacklist
   <https://github.com/nchammas/flintrock/issues/84#issuecomment-185038678>
   to Flintrock to avoid getting one of these dud mirrors.
   2. *Apache mirrors are inconvenient to use.* When you download something
   from an Apache mirror, you get a link like this one
   
<https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz>.
   Instead of automatically redirecting you to your download, though, you need
   to process the results you get back
   
<https://github.com/nchammas/flintrock/blob/67bf84a1b7cfa1c276cf57ecd8a0b27613ad2698/flintrock/scripts/download-hadoop.py#L21-L42>
   to find your download target. And you need to handle the high download
   failure rate, since sometimes the mirror you get doesn’t have the file it
   claims to have.
   3. *Apache mirrors are incomplete.* Apache mirrors only keep around the
   latest releases, save for a few “archive” mirrors, which are often slow. So
   if you want to download anything but the latest version of Spark, you are
   out of luck.

Some of these problems can be mitigated by picking a specific mirror that
works well and hardcoding it in your scripts, but that defeats the purpose
of dynamically selecting a mirror and makes you a “bad” user of the mirror
network.

I raised some of these issues over on INFRA-10999
<https://issues.apache.org/jira/browse/INFRA-10999>. The ticket sat for a
year before I heard anything back, and the bottom line was that none of the
above problems have a solution on the horizon. It’s fine. I understand that
Apache is a volunteer organization and that the infrastructure team has a
lot to manage as it is. I still find it disappointing that an organization
of Apache’s stature doesn’t have a better solution for this in
collaboration with a third party. Python serves PyPI downloads using Fastly
<https://www.fastly.com/> and Homebrew serves packages using Bintray
<https://bintray.com/>. They both work really, really well. Why don’t we
have something as good for Apache projects? Anyway, that’s a separate
discussion.

What I want to say is this:

Dear whoever owns the spark-related-packages S3 bucket
<https://s3.amazonaws.com/spark-related-packages/>,

Please keep the bucket up-to-date with the latest Spark releases, alongside
the past releases that are already on there. It’s a huge help to the
Flintrock <https://github.com/nchammas/flintrock> project, and it’s an
equally big help to those of us writing infrastructure automation scripts
that deploy Spark in other contexts.

I understand that hosting this stuff is not free, and that I am not paying
anything for this service. If it needs to go, so be it. But I wanted to
take this opportunity to lay out the benefits I’ve enjoyed thanks to having
this bucket around, and to make sure that if it did die, it didn’t die a
quiet death.

Sincerely,
Nick
​

Reply via email to