alive

Nicholas Chammas Tue, 27 Feb 2018 16:37:05 -0800

So is there no hope for this S3 bucket, or room to replace it with a bucket
owned by some organization other than AMPLab (which is technically now
defunct <https://amplab.cs.berkeley.edu/endofproject/>, I guess)? Sorry to
persist, but I just have to ask.


On Tue, Feb 27, 2018 at 10:36 AM Michael Heuer <heue...@gmail.com> wrote:

> On Tue, Feb 27, 2018 at 8:17 AM, Sean Owen <sro...@gmail.com> wrote:
>
>> See
>> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-d3kbcqa49mib13-cloudfront-net-td22427.html
>>  --
>> it was 'retired', yes.
>>
>> Agree with all that, though they're intended for occasional individual
>> use and not a case where performance and uptime matter. For that, I think
>> you'd want to just host your own copy of the bits you need.
>>
>> The notional problem was that the S3 bucket wasn't obviously
>> controlled/blessed by the ASF and yet was a source of official bits. It was
>> another set of third-party credentials to hand around to release managers,
>> which was IIRC a little problematic.
>>
>> Homebrew does host distributions of ASF projects, like Spark, FWIW.
>>
>
> To clarify, the apache-spark.rb formula in Homebrew uses the Apache
> mirror closer.lua script
>
>
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4
>
>    michael
>
>
>
>> On Mon, Feb 26, 2018 at 10:57 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> If you go to the Downloads <http://spark.apache.org/downloads.html>
>>> page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It
>>> didn’t use to be this way. As recently as Spark 2.2.0, downloads were
>>> served via CloudFront <https://aws.amazon.com/cloudfront/>, which was
>>> backed by an S3 bucket named spark-related-packages.
>>>
>>> It seems that we’ve stopped using CloudFront, and the S3 bucket behind
>>> it has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m
>>> guessing this is part of an effort to use the Apache mirror network, like
>>> other Apache projects do.
>>>
>>> From a user perspective, the Apache mirror network is several steps down
>>> from using a modern CDN. Let me summarize why:
>>>
>>>    1. *Apache mirrors are often slow.* Apache does not impose any
>>>    performance requirements on its mirrors
>>>    
>>> <https://issues.apache.org/jira/browse/INFRA-10999?focusedCommentId=15717950&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15717950>.
>>>    The difference between getting a good mirror and a bad one means
>>>    downloading Spark in less than a minute vs. 20 minutes. The problem is so
>>>    bad that I’ve thought about adding an Apache mirror blacklist
>>>    <https://github.com/nchammas/flintrock/issues/84#issuecomment-185038678>
>>>    to Flintrock to avoid getting one of these dud mirrors.
>>>    2. *Apache mirrors are inconvenient to use.* When you download
>>>    something from an Apache mirror, you get a link like this one
>>>    
>>> <https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz>.
>>>    Instead of automatically redirecting you to your download, though, you 
>>> need
>>>    to process the results you get back
>>>    
>>> <https://github.com/nchammas/flintrock/blob/67bf84a1b7cfa1c276cf57ecd8a0b27613ad2698/flintrock/scripts/download-hadoop.py#L21-L42>
>>>    to find your download target. And you need to handle the high download
>>>    failure rate, since sometimes the mirror you get doesn’t have the file it
>>>    claims to have.
>>>    3. *Apache mirrors are incomplete.* Apache mirrors only keep around
>>>    the latest releases, save for a few “archive” mirrors, which are often
>>>    slow. So if you want to download anything but the latest version of 
>>> Spark,
>>>    you are out of luck.
>>>
>>> Some of these problems can be mitigated by picking a specific mirror
>>> that works well and hardcoding it in your scripts, but that defeats the
>>> purpose of dynamically selecting a mirror and makes you a “bad” user of the
>>> mirror network.
>>>
>>> I raised some of these issues over on INFRA-10999
>>> <https://issues.apache.org/jira/browse/INFRA-10999>. The ticket sat for
>>> a year before I heard anything back, and the bottom line was that none of
>>> the above problems have a solution on the horizon. It’s fine. I understand
>>> that Apache is a volunteer organization and that the infrastructure team
>>> has a lot to manage as it is. I still find it disappointing that an
>>> organization of Apache’s stature doesn’t have a better solution for this in
>>> collaboration with a third party. Python serves PyPI downloads using
>>> Fastly <https://www.fastly.com/> and Homebrew serves packages using
>>> Bintray <https://bintray.com/>. They both work really, really well. Why
>>> don’t we have something as good for Apache projects? Anyway, that’s a
>>> separate discussion.
>>>
>>> What I want to say is this:
>>>
>>> Dear whoever owns the spark-related-packages S3 bucket
>>> <https://s3.amazonaws.com/spark-related-packages/>,
>>>
>>> Please keep the bucket up-to-date with the latest Spark releases,
>>> alongside the past releases that are already on there. It’s a huge help to
>>> the Flintrock <https://github.com/nchammas/flintrock> project, and it’s
>>> an equally big help to those of us writing infrastructure automation
>>> scripts that deploy Spark in other contexts.
>>>
>>> I understand that hosting this stuff is not free, and that I am not
>>> paying anything for this service. If it needs to go, so be it. But I wanted
>>> to take this opportunity to lay out the benefits I’ve enjoyed thanks to
>>> having this bucket around, and to make sure that if it did die, it didn’t
>>> die a quiet death.
>>>
>>> Sincerely,
>>> Nick
>>> 
>>>
>>

Re: Please keep s3://spark-related-packages/ alive

Reply via email to