Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Jiaxin Shan
+1
I build binary using the following command, test spark workloads on
Kubernetes (AWS EKS) and it's working well.

./dev/make-distribution.sh --name spark-v3.0.0-rc3-20200608 --tgz
-Phadoop-3.2 -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud
-Pscala-2.12

On Mon, Jun 8, 2020 at 7:13 PM Bryan Cutler  wrote:

> +1 (non-binding)
>
> On Mon, Jun 8, 2020, 1:49 PM Tom Graves 
> wrote:
>
>> +1
>>
>> Tom
>>
>> On Saturday, June 6, 2020, 03:09:09 PM CDT, Reynold Xin <
>> r...@databricks.com> wrote:
>>
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0.
>>
>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>> are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-rc3 (commit
>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> This release is using the release script of the tag v3.0.0-rc3.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>>
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>

-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA


Re: [VOTE][RESULT] Spark 2.4.5 (RC2)

2020-02-06 Thread Jiaxin Shan
+1. Another DP.   ` -Phadoop-2.7 -Pkubernetes -Phive`

I tested 2.4.5 on Kubernetes (Amazon EKS 1.14) and it works well.   Sorry
for late.

On Wed, Feb 5, 2020 at 11:11 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> The vote passes. Thanks to all who helped with this release 2.4.5!
> I'll follow up later with a release announcement once everything is
> published.
>
> +1 (* = binding):
> - Dongjoon Hyun *
> - Wenchen Fan *
> - Hyukjin Kwon *
> - Takeshi Yamamuro
> - Maxim Gekk
> - Sean Owen *
>
> +0: None
>
> -1: None
>
> Bests,
> Dongjoon.
>


-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA


Re: Apache Spark Docker image repository

2020-02-05 Thread Jiaxin Shan
I will vote for this. It's pretty helpful to have managed Spark images.
Currently, user have to download Spark binaries and build their own.
With this supported, user journey will be simplified and we only need to
build an application image on top of base image provided by community.

Do we have different OS or architecture support? If not, there will be
Java, R, Python total 3 container images for every release.


On Wed, Feb 5, 2020 at 2:56 PM Sean Owen  wrote:

> What would the images have - just the image for a worker?
> We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
> But if we don't then we make one or a few choices of that combo, and
> then I wonder how many people find the image useful.
> If the goal is just to support Spark testing, that seems fine and
> tractable, but does it need to be 'public' as in advertised as a
> convenience binary? vs just some image that's hosted somewhere for the
> benefit of project infra.
>
> On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
> >
> > I'm considering the following images.
> >
> > - Public binary release (no snapshot image)
> > - Public non-Spark base image (OS + R + Python)
> >   (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler environments)
> >
> > Bests,
> > Dongjoon.
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Jiaxin Shan
+1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is
only available in 3.2. We see lots of users asking for better S3A support
in Spark.

On Fri, Nov 1, 2019 at 9:46 AM Xiao Li  wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against Hadoop
> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
> risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran  wrote:
>
>> What is the current default value? as the 2.x releases are becoming EOL;
>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>> inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The changes
>>> are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> unchanged. When it becomes stable, we can change the default profile to
>>> Hadoop-3.2.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>>
>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>> implications.
>>>> That said my guess is we're close to the point where we don't need to
>>>> support Hadoop 2.x anyway, so, yeah.
>>>>
>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>>>> be the same because we didn't change anything yet.
>>>> >
>>>> > Technically, we need to change two places for publishing.
>>>> >
>>>> > 1. Jenkins Snapshot Publishing
>>>> >
>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>> >
>>>> > 2. Release Snapshot/Release Publishing
>>>> >
>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>> >
>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>> >
>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>> optionally.
>>>> >
>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA