Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Aniket Mokashi
Felix - please add me to this event.

Ben - should we move this proposal to a doc and open it up for
edits/comments.

On Wed, Nov 20, 2019 at 5:37 PM Felix Cheung 
wrote:

> Great!
>
> Due to number of constraints I won’t be sending link directly here but
> please r me and I will add you.
>
>
> --
> *From:* Ben Sidhom 
> *Sent:* Wednesday, November 20, 2019 9:10:01 AM
> *To:* John Zhuge 
> *Cc:* bo yang ; Amogh Margoor ;
> Ryan Blue ; Ben Sidhom ;
> Spark Dev List ; Christopher Crosbie <
> crosb...@google.com>; Griselda Cuevas ; Holden Karau <
> hol...@pigscanfly.ca>; Mayank Ahuja ; Kalyan Sivakumar
> ; alfo...@fb.com ; Felix Cheung <
> fel...@uber.com>; Matt Cheah ; Yifei Huang (PD) <
> yif...@palantir.com>
> *Subject:* Re: Enabling fully disaggregated shuffle on Spark
>
> That sounds great!
>
> On Wed, Nov 20, 2019 at 9:02 AM John Zhuge  wrote:
>
> That will be great. Please send us the invite.
>
> On Wed, Nov 20, 2019 at 8:56 AM bo yang  wrote:
>
> Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested!
> Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm
> PST. We could discuss more details there. Do you want to join?
>
> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:
>
> We at Qubole are also looking at disaggregating shuffle on Spark. Would
> love to collaborate and share learnings.
>
> Regards,
> Amogh
>
> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>
> Great work, Bo! Would love to hear the details.
>
>
> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
> wrote:
>
> I'm interested in remote shuffle services as well. I'd love to hear about
> what you're using in production!
>
> rb
>
> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>
> Hi Ben,
>
> Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
> Seattle, and working on disaggregated shuffle (we called it remote shuffle
> service, RSS, internally). We have put RSS into production for a while, and
> learned a lot during the work (tried quite a few techniques to improve the
> remote shuffle performance). We could share our learning with the
> community, and also would like to hear feedback/suggestions on how to
> further improve remote shuffle performance. We could chat more details if
> you or other people are interested.
>
> Best,
> Bo
>
> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
> wrote:
>
> I would like to start a conversation about extending the Spark shuffle
> manager surface to support fully disaggregated shuffle implementations.
> This is closely related to the work in SPARK-25299
> , which is focused on
> refactoring the shuffle manager API (and in particular, SortShuffleManager)
> to use a pluggable storage backend. The motivation for that SPIP is further
> enabling Spark on Kubernetes.
>
>
> The motivation for this proposal is enabling full externalized
> (disaggregated) shuffle service implementations. (Facebook’s Cosco shuffle
> 
> is one example of such a disaggregated shuffle service.) These changes
> allow the bulk of the shuffle to run in a remote service such that minimal
> state resides in executors and local disk spill is minimized. The net
> effect is increased job stability and performance improvements in certain
> scenarios. These changes should work well with or are complementary to
> SPARK-25299. Some or all points may be merged into that issue as
> appropriate.
>
>
> Below is a description of each component of this proposal. These changes
> can ideally be introduced incrementally. I would like to gather feedback
> and gauge interest from others in the community to collaborate on this.
> There are likely more points that would  be useful to disaggregated shuffle
> services. We can outline a more concrete plan after gathering enough input.
> A working session could help us kick off this joint effort; maybe something
> in the mid-January to mid-February timeframe (depending on interest and
> availability. I’m happy to host at our Sunnyvale, CA offices.
>
>
> Proposal Scheduling and re-executing tasks
>
> Allow coordination between the service and the Spark DAG scheduler as to
> whether a given block/partition needs to be recomputed when a task fails or
> when shuffle block data cannot be read. Having such coordination is
> important, e.g., for suppressing recomputation after aborted executors or
> for forcing late recomputation if the service internally acts as a cache.
> One catchall solution is to have the shuffle manager provide an indication
> of whether shuffle data is external to executors (or nodes). Another
> option: allow the shuffle manager (likely on the driver) to be queried for
> the existence of shuffle data for a given executor ID (or perhaps map task,
> reduce task, etc). Note that this is at the level of data the scheduler is
> aware of (i.e., map/reduce partitions) rather 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Felix Cheung
Great!

Due to number of constraints I won’t be sending link directly here but please r 
me and I will add you.



From: Ben Sidhom 
Sent: Wednesday, November 20, 2019 9:10:01 AM
To: John Zhuge 
Cc: bo yang ; Amogh Margoor ; Ryan Blue 
; Ben Sidhom ; Spark Dev List 
; Christopher Crosbie ; Griselda 
Cuevas ; Holden Karau ; Mayank Ahuja 
; Kalyan Sivakumar ; alfo...@fb.com 
; Felix Cheung ; Matt Cheah 
; Yifei Huang (PD) 
Subject: Re: Enabling fully disaggregated shuffle on Spark

That sounds great!

On Wed, Nov 20, 2019 at 9:02 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
That will be great. Please send us the invite.

On Wed, Nov 20, 2019 at 8:56 AM bo yang 
mailto:bobyan...@gmail.com>> wrote:
Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested! 
Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm 
PST. We could discuss more details there. Do you want to join?

On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor 
mailto:amo...@qubole.com>> wrote:
We at Qubole are also looking at disaggregating shuffle on Spark. Would love to 
collaborate and share learnings.

Regards,
Amogh

On Tue, Nov 19, 2019 at 4:09 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Great work, Bo! Would love to hear the details.


On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue  wrote:
I'm interested in remote shuffle services as well. I'd love to hear about what 
you're using in production!

rb

On Tue, Nov 19, 2019 at 2:43 PM bo yang 
mailto:bobyan...@gmail.com>> wrote:
Hi Ben,

Thanks for the writing up! This is Bo from Uber. I am in Felix's team in 
Seattle, and working on disaggregated shuffle (we called it remote shuffle 
service, RSS, internally). We have put RSS into production for a while, and 
learned a lot during the work (tried quite a few techniques to improve the 
remote shuffle performance). We could share our learning with the community, 
and also would like to hear feedback/suggestions on how to further improve 
remote shuffle performance. We could chat more details if you or other people 
are interested.

Best,
Bo

On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom  wrote:

I would like to start a conversation about extending the Spark shuffle manager 
surface to support fully disaggregated shuffle implementations. This is closely 
related to the work in 
SPARK-25299, which is 
focused on refactoring the shuffle manager API (and in particular, 
SortShuffleManager) to use a pluggable storage backend. The motivation for that 
SPIP is further enabling Spark on Kubernetes.


The motivation for this proposal is enabling full externalized (disaggregated) 
shuffle service implementations. (Facebook’s Cosco 
shuffle
 is one example of such a disaggregated shuffle service.) These changes allow 
the bulk of the shuffle to run in a remote service such that minimal state 
resides in executors and local disk spill is minimized. The net effect is 
increased job stability and performance improvements in certain scenarios. 
These changes should work well with or are complementary to SPARK-25299. Some 
or all points may be merged into that issue as appropriate.


Below is a description of each component of this proposal. These changes can 
ideally be introduced incrementally. I would like to gather feedback and gauge 
interest from others in the community to collaborate on this. There are likely 
more points that would  be useful to disaggregated shuffle services. We can 
outline a more concrete plan after gathering enough input. A working session 
could help us kick off this joint effort; maybe something in the mid-January to 
mid-February timeframe (depending on interest and availability. I’m happy to 
host at our Sunnyvale, CA offices.


Proposal
Scheduling and re-executing tasks

Allow coordination between the service and the Spark DAG scheduler as to 
whether a given block/partition needs to be recomputed when a task fails or 
when shuffle block data cannot be read. Having such coordination is important, 
e.g., for suppressing recomputation after aborted executors or for forcing late 
recomputation if the service internally acts as a cache. One catchall solution 
is to have the shuffle manager provide an indication of whether shuffle data is 
external to executors (or nodes). Another option: allow the shuffle manager 
(likely on the driver) to be queried for the existence of shuffle data for a 
given executor ID (or perhaps map task, reduce task, etc). Note that this is at 
the level of data the scheduler is aware of (i.e., map/reduce partitions) 
rather than block IDs, which are internal details for some shuffle managers.

ShuffleManager API

Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the 
service knows that data is still active. This is one way to enable 
time-/job-scoped data because a disaggregated shuffle 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Thank you all.

I'll try to make JIRA and PR for that.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian  wrote:

> Sean, thanks for the corner cases you listed. They make a lot of sense.
> Now I do incline to have Hive 2.3 as the default version.
>
> Dongjoon, apologize if I didn't make it clear before. What made me
> concerned initially was only the following part:
>
> > can we remove the usage of forked `hive` in Apache Spark 3.0 completely
> officially?
>
> So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
> profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
> Thanks for starting the discussion!
>
> On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun 
> wrote:
>
>> Yes. Right. That's the situation we are hitting and the result I expected.
>> We need to change our default with Hive 2 in the POM.
>>
>> Dongjoon.
>>
>>
>> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:
>>
>>> Yes, good point. A user would get whatever the POM says without
>>> profiles enabled so it matters.
>>>
>>> Playing it out, an app _should_ compile with the Spark dependency
>>> marked 'provided'. In that case the app that is spark-submit-ted is
>>> agnostic to the Hive dependency as the only one that matters is what's
>>> on the cluster. Right? we don't leak through the Hive API in the Spark
>>> API. And yes it's then up to the cluster to provide whatever version
>>> it wants. Vendors will have made a specific version choice when
>>> building their distro one way or the other.
>>>
>>> If you run a Spark cluster yourself, you're using the binary distro,
>>> and we're already talking about also publishing a binary distro with
>>> this variation, so that's not the issue.
>>>
>>> The corner cases where it might matter are:
>>>
>>> - I unintentionally package Spark in the app and by default pull in
>>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>>> causes other problems
>>> - I run tests locally in my project, which will pull in a default
>>> version of Hive defined by the POM
>>>
>>> Double-checking, is that right? if so it kind of implies it doesn't
>>> matter. Which is an argument either way about what's the default. I
>>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>>> something about the implication?
>>>
>>> (That fork will stay published forever anyway, that's not an issue per
>>> se.)
>>>
>>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
>>> wrote:
>>> > Sean, our published POM is pointing and advertising the illegitimate
>>> Hive 1.2 fork as a compile dependency.
>>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>>> like that?
>>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>>> > Those artifacts will be orphans.
>>> >
>>>
>>


Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Thank you for much thoughtful clarification. I agree with your all options.

Especially, for Hive Metastore connection, `Hive isolated client loader` is
also important with Hive 2.3 because Hive 2.3 client cannot talk with Hive
2.1 and lower. `Hive Isolated client loader` is one of the good design in
Apache Spark.

One of the reason I started this thread focusing on the fork is that we *don't
use* that fork actually.

https://mvnrepository.com/artifact/org.spark-project.hive/

Big companies (and vendors) maintains their own fork of that fork or
upgrade its hive dependency already. So, when we say it's battle-tested, it
does not mean it really. It's not tested.

The above repository becomes something like a stranded phantom. We pointed
that repo as a legacy interface, and we don't use the code really in the
large production environments. Since there is no way to contribute back to
that repo, we also have a segmentation problem on the experience with Hive
1.2.1. Someone may say it's good while the others still struggles without
any community support.

Anyway, thank you so much for the conclusion.
I'll try to make a JIRA and PR for `hive-1.2` profile first as a conclusion.

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 4:10 PM Cheng Lian  wrote:

> Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
> will need a hive-2.3 profile anyway, no matter having the hive-1.2
> profile or not.
>
> On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian  wrote:
>
>> Just to summarize my points:
>>
>>1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>>optional. End-users may choose between Hive 1.2/2.3 via a new profile
>>(either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>>me, depending on which Hive version we pick as the default version).
>>2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>>people may have more choices in production, and makes Spark 3.0 migration
>>easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>>2.3 and/or JDK 11.).
>>3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>>have a preference as long as the above two are met.
>>
>>
>> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:
>>
>>> Dongjoon, I don't think we have any conflicts here. As stated in other
>>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>>> can be decoupled, I have no preference over picking which Hive/Hadoop
>>> version as the default version. So the following two plans both work for me:
>>>
>>>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>>have an extra hive-2.3 profile.
>>>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>>have an extra hive-1.2 profile.
>>>
>>> BTW, I was also discussing Hive dependency issues with other people
>>> offline, and I realized that the Hive isolated client loader is not well
>>> known, and caused unnecessary confusion/worry. So I would like to provide
>>> some background context for readers who are not familiar with Spark Hive
>>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean
>>> that you can only interact with Hive 1.2.1.*
>>>
>>> Spark does work with different versions of Hive metastore via an
>>> isolated classloading mechanism. *Even if Spark itself is built with
>>> the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and
>>> this has been true ever since Spark 1.x.* In order to do this, just set
>>> the following two options according to instructions in our official doc
>>> page
>>> 
>>> :
>>>
>>>- spark.sql.hive.metastore.version
>>>- spark.sql.hive.metastore.jars
>>>
>>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>>> dependencies from Maven at runtime when initializing the Hive metastore
>>> client. And those dependencies will NOT conflict with the built-in Hive
>>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>>> classloader (see here
>>> ).
>>> Historically, we call these two sets of Hive dependencies "execution Hive"
>>> and "metastore Hive". The former is mostly used for features like SerDe,
>>> while the latter is used to interact with Hive metastore. And the Hive
>>> version upgrade we are discussing here is about the execution Hive.
>>>
>>> Cheng
>>>
>>> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
>>> wrote:
>>>
 Nice. That's a progress.

 Let's narrow down to the path. We need to clarify what is the criteria
 we can agree.

 1. What does `battle-tested for years` mean 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile
or not.

On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian  wrote:

> Just to summarize my points:
>
>1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>optional. End-users may choose between Hive 1.2/2.3 via a new profile
>(either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>me, depending on which Hive version we pick as the default version).
>2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>people may have more choices in production, and makes Spark 3.0 migration
>easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>2.3 and/or JDK 11.).
>3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>have a preference as long as the above two are met.
>
>
> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:
>
>> Dongjoon, I don't think we have any conflicts here. As stated in other
>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>> can be decoupled, I have no preference over picking which Hive/Hadoop
>> version as the default version. So the following two plans both work for me:
>>
>>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>have an extra hive-2.3 profile.
>>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>have an extra hive-1.2 profile.
>>
>> BTW, I was also discussing Hive dependency issues with other people
>> offline, and I realized that the Hive isolated client loader is not well
>> known, and caused unnecessary confusion/worry. So I would like to provide
>> some background context for readers who are not familiar with Spark Hive
>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
>> you can only interact with Hive 1.2.1.*
>>
>> Spark does work with different versions of Hive metastore via an isolated
>> classloading mechanism. *Even if Spark itself is built with the Hive
>> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
>> been true ever since Spark 1.x.* In order to do this, just set the
>> following two options according to instructions in our official doc page
>> 
>> :
>>
>>- spark.sql.hive.metastore.version
>>- spark.sql.hive.metastore.jars
>>
>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>> dependencies from Maven at runtime when initializing the Hive metastore
>> client. And those dependencies will NOT conflict with the built-in Hive
>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>> classloader (see here
>> ).
>> Historically, we call these two sets of Hive dependencies "execution Hive"
>> and "metastore Hive". The former is mostly used for features like SerDe,
>> while the latter is used to interact with Hive metastore. And the Hive
>> version upgrade we are discussing here is about the execution Hive.
>>
>> Cheng
>>
>> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
>> wrote:
>>
>>> Nice. That's a progress.
>>>
>>> Let's narrow down to the path. We need to clarify what is the criteria
>>> we can agree.
>>>
>>> 1. What does `battle-tested for years` mean exactly?
>>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>>
>>> 2. What is the new "Hive integration in Spark"?
>>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>>> said.
>>> Most of code is shared for Hive 1.2 and Hive 2.3.
>>> That means if there is a bug inside this shared code, both of them
>>> will be affected.
>>> Of course, we can fix this because it's Spark code. We will learn
>>> and fix it as you said.
>>>
>>> >  Yes, there are issues, but people have learned how to get along
>>> with these issues.
>>>
>>> The only non-shared code are the following.
>>> Do you have a concern on the following directories?
>>> If there is no bugs on the following codebase, can we switch?
>>>
>>> $ find . -name v2.3.5
>>> ./sql/core/v2.3.5
>>> ./sql/hive-thriftserver/v2.3.5
>>>
>>> 3. We know that we can keep both code bases, but the community should
>>> choose Hive 2.3 officially.
>>> That's the right choice in the Apache project policy perspective. At
>>> least, Sean and I prefer that.
>>> If someone really want to stick to Hive 1.2 fork, they can use it at
>>> their own risks.
>>>
>>> > for Spark 3.0 end-users who really don't want to interact with
>>> this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>>>
>>> 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely
officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun 
wrote:

> Yes. Right. That's the situation we are hitting and the result I expected.
> We need to change our default with Hive 2 in the POM.
>
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:
>
>> Yes, good point. A user would get whatever the POM says without
>> profiles enabled so it matters.
>>
>> Playing it out, an app _should_ compile with the Spark dependency
>> marked 'provided'. In that case the app that is spark-submit-ted is
>> agnostic to the Hive dependency as the only one that matters is what's
>> on the cluster. Right? we don't leak through the Hive API in the Spark
>> API. And yes it's then up to the cluster to provide whatever version
>> it wants. Vendors will have made a specific version choice when
>> building their distro one way or the other.
>>
>> If you run a Spark cluster yourself, you're using the binary distro,
>> and we're already talking about also publishing a binary distro with
>> this variation, so that's not the issue.
>>
>> The corner cases where it might matter are:
>>
>> - I unintentionally package Spark in the app and by default pull in
>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>> causes other problems
>> - I run tests locally in my project, which will pull in a default
>> version of Hive defined by the POM
>>
>> Double-checking, is that right? if so it kind of implies it doesn't
>> matter. Which is an argument either way about what's the default. I
>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>> something about the implication?
>>
>> (That fork will stay published forever anyway, that's not an issue per
>> se.)
>>
>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
>> wrote:
>> > Sean, our published POM is pointing and advertising the illegitimate
>> Hive 1.2 fork as a compile dependency.
>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>> like that?
>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>> > Those artifacts will be orphans.
>> >
>>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
Hey Nicholas,

Thanks for pointing this out. I just realized that I misread the
spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles,
"hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud
POM (here
 and
here ).
But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile
is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in
Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for
the misleading information.

Cheng



On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas 
wrote:

> > I don't think the default Hadoop version matters except for the
> spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian  wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>> matters except for the spark-hadoop-cloud module, which is only meaningful
>> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
>> Maven central are Hadoop-version-neutral.
>>
>> Another issue about switching the default Hadoop version to 3.2 is
>> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt
>> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency
>> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
>> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>>
>> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via
>> the proposed hive-2.3 profile, I personally don't have a preference over
>> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for
>> minimizing the release management work, in case we decided to publish other
>> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case
>> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>>
>> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
>> wrote:
>>
>>> I also agree with Steve and Felix.
>>>
>>> Let's have another thread to discuss Hive issue
>>>
>>> because this thread was originally for `hadoop` version.
>>>
>>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>>> `hadoop-3.0` versions.
>>>
>>> We don't need to mix both.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>>> wrote:
>>>
 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
 It is old and rather buggy; and It’s been *years*

 I think we should decouple hive change from everything else if people
 are concerned?

 --
 *From:* Steve Loughran 
 *Sent:* Sunday, November 17, 2019 9:22:09 AM
 *To:* Cheng Lian 
 *Cc:* Sean Owen ; Wenchen Fan ;
 Dongjoon Hyun ; dev ;
 Yuming Wang 
 *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

 Can I take this moment to remind everyone that the version of hive
 which spark has historically bundled (the org.spark-project one) is an
 orphan project put together to deal with Hive's shading issues and a source
 of unhappiness in the Hive project. What ever get shipped should do its
 best to avoid including that file.

 Postponing a switch to hadoop 3.x after spark 3.0 is probably the
 safest move from a risk minimisation perspective. If something has broken
 then it is you can start with the assumption that it is in the o.a.s
 packages without having to debug o.a.hadoop and o.a.hive first. There is a
 cost: if there are problems with the hadoop / hive dependencies those teams
 will inevitably ignore filed bug reports for the same reason spark team
 will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for
 the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear
 that in mind. It's not been tested, it has dependencies on artifacts we
 know are incompatible, and as far as the Hadoop project is concerned:
 people should move to branch 3 if they want to run on a modern version of
 Java

 It would be really really good if the published spark maven artefacts
 (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
 3.x. That way people doing things with their own projects will get
 up-to-date dependencies and don't get WONTFIX responses themselves.

 -Steve

 PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
 ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
 the transition release.

 On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
 wrote:

 Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
 thought the 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Just to summarize my points:

   1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
   optional. End-users may choose between Hive 1.2/2.3 via a new profile
   (either adding a hive-1.2 profile or adding a hive-2.3 profile works for
   me, depending on which Hive version we pick as the default version).
   2. Decouple Hive version upgrade and Hadoop version upgrade, so that
   people may have more choices in production, and makes Spark 3.0 migration
   easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
   2.3 and/or JDK 11.).
   3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
   have a preference as long as the above two are met.


On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:

> Dongjoon, I don't think we have any conflicts here. As stated in other
> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
> can be decoupled, I have no preference over picking which Hive/Hadoop
> version as the default version. So the following two plans both work for me:
>
>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
>an extra hive-2.3 profile.
>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>have an extra hive-1.2 profile.
>
> BTW, I was also discussing Hive dependency issues with other people
> offline, and I realized that the Hive isolated client loader is not well
> known, and caused unnecessary confusion/worry. So I would like to provide
> some background context for readers who are not familiar with Spark Hive
> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
> you can only interact with Hive 1.2.1.*
>
> Spark does work with different versions of Hive metastore via an isolated
> classloading mechanism. *Even if Spark itself is built with the Hive
> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
> been true ever since Spark 1.x.* In order to do this, just set the
> following two options according to instructions in our official doc page
> 
> :
>
>- spark.sql.hive.metastore.version
>- spark.sql.hive.metastore.jars
>
> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
> dependencies from Maven at runtime when initializing the Hive metastore
> client. And those dependencies will NOT conflict with the built-in Hive
> 1.2.1 jars, because the downloaded jars are loaded using an isolated
> classloader (see here
> ).
> Historically, we call these two sets of Hive dependencies "execution Hive"
> and "metastore Hive". The former is mostly used for features like SerDe,
> while the latter is used to interact with Hive metastore. And the Hive
> version upgrade we are discussing here is about the execution Hive.
>
> Cheng
>
> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
> wrote:
>
>> Nice. That's a progress.
>>
>> Let's narrow down to the path. We need to clarify what is the criteria we
>> can agree.
>>
>> 1. What does `battle-tested for years` mean exactly?
>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>
>> 2. What is the new "Hive integration in Spark"?
>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>> said.
>> Most of code is shared for Hive 1.2 and Hive 2.3.
>> That means if there is a bug inside this shared code, both of them
>> will be affected.
>> Of course, we can fix this because it's Spark code. We will learn and
>> fix it as you said.
>>
>> >  Yes, there are issues, but people have learned how to get along
>> with these issues.
>>
>> The only non-shared code are the following.
>> Do you have a concern on the following directories?
>> If there is no bugs on the following codebase, can we switch?
>>
>> $ find . -name v2.3.5
>> ./sql/core/v2.3.5
>> ./sql/hive-thriftserver/v2.3.5
>>
>> 3. We know that we can keep both code bases, but the community should
>> choose Hive 2.3 officially.
>> That's the right choice in the Apache project policy perspective. At
>> least, Sean and I prefer that.
>> If someone really want to stick to Hive 1.2 fork, they can use it at
>> their own risks.
>>
>> > for Spark 3.0 end-users who really don't want to interact with this
>> Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>>
>> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
>> default Hive 2.3 pom at least?
>> How do you think about that way, Cheng?
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian 
>> wrote:
>>
>>> Hey Dongjoon and Felix,
>>>
>>> I totally agree that Hive 2.3 is more 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Dongjoon, I don't think we have any conflicts here. As stated in other
threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
can be decoupled, I have no preference over picking which Hive/Hadoop
version as the default version. So the following two plans both work for me:

   1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
   an extra hive-2.3 profile.
   2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have
   an extra hive-1.2 profile.

BTW, I was also discussing Hive dependency issues with other people
offline, and I realized that the Hive isolated client loader is not well
known, and caused unnecessary confusion/worry. So I would like to provide
some background context for readers who are not familiar with Spark Hive
integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
you can only interact with Hive 1.2.1.*

Spark does work with different versions of Hive metastore via an isolated
classloading mechanism. *Even if Spark itself is built with the Hive 1.2.1
fork, you can still interact with a Hive 2.3 metastore, and this has been
true ever since Spark 1.x.* In order to do this, just set the following two
options according to instructions in our official doc page

:

   - spark.sql.hive.metastore.version
   - spark.sql.hive.metastore.jars

Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
"spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
dependencies from Maven at runtime when initializing the Hive metastore
client. And those dependencies will NOT conflict with the built-in Hive
1.2.1 jars, because the downloaded jars are loaded using an isolated
classloader (see here
).
Historically, we call these two sets of Hive dependencies "execution Hive"
and "metastore Hive". The former is mostly used for features like SerDe,
while the latter is used to interact with Hive metastore. And the Hive
version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
wrote:

> Nice. That's a progress.
>
> Let's narrow down to the path. We need to clarify what is the criteria we
> can agree.
>
> 1. What does `battle-tested for years` mean exactly?
> How and when can we start the `battle-tested` stage for Hive 2.3?
>
> 2. What is the new "Hive integration in Spark"?
> During introducing Hive 2.3, we fixed the compatibility stuff as you
> said.
> Most of code is shared for Hive 1.2 and Hive 2.3.
> That means if there is a bug inside this shared code, both of them
> will be affected.
> Of course, we can fix this because it's Spark code. We will learn and
> fix it as you said.
>
> >  Yes, there are issues, but people have learned how to get along
> with these issues.
>
> The only non-shared code are the following.
> Do you have a concern on the following directories?
> If there is no bugs on the following codebase, can we switch?
>
> $ find . -name v2.3.5
> ./sql/core/v2.3.5
> ./sql/hive-thriftserver/v2.3.5
>
> 3. We know that we can keep both code bases, but the community should
> choose Hive 2.3 officially.
> That's the right choice in the Apache project policy perspective. At
> least, Sean and I prefer that.
> If someone really want to stick to Hive 1.2 fork, they can use it at
> their own risks.
>
> > for Spark 3.0 end-users who really don't want to interact with this
> Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>
> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
> default Hive 2.3 pom at least?
> How do you think about that way, Cheng?
>
> Bests,
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian  wrote:
>
>> Hey Dongjoon and Felix,
>>
>> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
>> wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>>
>> However, *"Hive" and "Hive integration in Spark" are two quite different
>> things*, and I don't think anybody has ever mentioned "the forked Hive
>> 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
>> double-checked all my replies).
>>
>> What I really care about is the stability and quality of "Hive
>> integration in Spark", which have gone through some major updates due to
>> the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this
>> piece, and empirically, for a significant upgrade like this one, it is not
>> surprising that other bugs/regressions can be found in the near future. On
>> the other hand, the Hive 1.2 integration code path in Spark has been
>> battle-tested for years. Yes, there are issues, but people have learned 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we
can agree.

1. What does `battle-tested for years` mean exactly?
How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
During introducing Hive 2.3, we fixed the compatibility stuff as you
said.
Most of code is shared for Hive 1.2 and Hive 2.3.
That means if there is a bug inside this shared code, both of them will
be affected.
Of course, we can fix this because it's Spark code. We will learn and
fix it as you said.

>  Yes, there are issues, but people have learned how to get along with
these issues.

The only non-shared code are the following.
Do you have a concern on the following directories?
If there is no bugs on the following codebase, can we switch?

$ find . -name v2.3.5
./sql/core/v2.3.5
./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should
choose Hive 2.3 officially.
That's the right choice in the Apache project policy perspective. At
least, Sean and I prefer that.
If someone really want to stick to Hive 1.2 fork, they can use it at
their own risks.

> for Spark 3.0 end-users who really don't want to interact with this
Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian  wrote:

> Hey Dongjoon and Felix,
>
> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
> wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>
> However, *"Hive" and "Hive integration in Spark" are two quite different
> things*, and I don't think anybody has ever mentioned "the forked Hive
> 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
> double-checked all my replies).
>
> What I really care about is the stability and quality of "Hive integration
> in Spark", which have gone through some major updates due to the recent
> Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and
> empirically, for a significant upgrade like this one, it is not surprising
> that other bugs/regressions can be found in the near future. On the other
> hand, the Hive 1.2 integration code path in Spark has been battle-tested
> for years. Yes, there are issues, but people have learned how to get along
> with these issues. And please don't forget that, for Spark 3.0 end-users
> who really don't want to interact with this Hive 1.2 fork, they can always
> use Hive 2.3 at their own risks.
>
> True, "stable" is quite vague a criterion, and hard to be proven. But that
> is exactly the reason why we may want to be conservative and wait for some
> time and see whether there are further signals suggesting that the Hive 2.3
> integration in Spark 3.0 is *unstable*. After one or two Spark 3.x minor
> releases, if we've fixed all the outstanding issues and no more significant
> ones are showing up, we can declare that the Hive 2.3 integration in Spark
> 3.x is stable, and then we can consider removing reference to the Hive 1.2
> fork. Does that make sense?
>
> Cheng
>
> On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung 
> wrote:
>
>> Just to add - hive 1.2 fork is definitely not more stable. We know of a
>> few critical bug fixes that we cherry picked into a fork of that fork to
>> maintain ourselves.
>>
>>
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
>> *To:* Sean Owen 
>> *Cc:* dev 
>> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>>
>> Thanks. That will be a giant step forward, Sean!
>>
>> > I'd prefer making it the default in the POM for 3.0.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:
>>
>> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
>> same old and buggy that's been there a while. "stable" in that sense
>> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
>> bug fixes that are important; the question isn't just 1.x releases.
>>
>> What I don't know is how much affects Spark, as it's a Hive client
>> mostly. Clearly some do.
>>
>> I'd prefer making it the default in the POM for 3.0. Mostly on the
>> grounds that its effects are on deployed clusters, not apps. And
>> deployers can still choose a binary distro with 1.x or make the choice
>> they want. Those that don't care should probably be nudged to 2.x.
>> Spark 3.x is already full of behavior changes and 'unstable', so I
>> think this is minor relative to the overall risk question.
>>
>> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > I'm sending this email because it's important to discuss this topic
>> narrowly
>> > and make a clear 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, *"Hive" and "Hive integration in Spark" are two quite different
things*, and I don't think anybody has ever mentioned "the forked Hive
1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
double-checked all my replies).

What I really care about is the stability and quality of "Hive integration
in Spark", which have gone through some major updates due to the recent
Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and
empirically, for a significant upgrade like this one, it is not surprising
that other bugs/regressions can be found in the near future. On the other
hand, the Hive 1.2 integration code path in Spark has been battle-tested
for years. Yes, there are issues, but people have learned how to get along
with these issues. And please don't forget that, for Spark 3.0 end-users
who really don't want to interact with this Hive 1.2 fork, they can always
use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that
is exactly the reason why we may want to be conservative and wait for some
time and see whether there are further signals suggesting that the Hive 2.3
integration in Spark 3.0 is *unstable*. After one or two Spark 3.x minor
releases, if we've fixed all the outstanding issues and no more significant
ones are showing up, we can declare that the Hive 2.3 integration in Spark
3.x is stable, and then we can consider removing reference to the Hive 1.2
fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung 
wrote:

> Just to add - hive 1.2 fork is definitely not more stable. We know of a
> few critical bug fixes that we cherry picked into a fork of that fork to
> maintain ourselves.
>
>
> --
> *From:* Dongjoon Hyun 
> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
> *To:* Sean Owen 
> *Cc:* dev 
> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>
> Thanks. That will be a giant step forward, Sean!
>
> > I'd prefer making it the default in the POM for 3.0.
>
> Bests,
> Dongjoon.
>
> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:
>
> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
> same old and buggy that's been there a while. "stable" in that sense
> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
> bug fixes that are important; the question isn't just 1.x releases.
>
> What I don't know is how much affects Spark, as it's a Hive client
> mostly. Clearly some do.
>
> I'd prefer making it the default in the POM for 3.0. Mostly on the
> grounds that its effects are on deployed clusters, not apps. And
> deployers can still choose a binary distro with 1.x or make the choice
> they want. Those that don't care should probably be nudged to 2.x.
> Spark 3.x is already full of behavior changes and 'unstable', so I
> think this is minor relative to the overall risk question.
>
> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > I'm sending this email because it's important to discuss this topic
> narrowly
> > and make a clear conclusion.
> >
> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> > by ignoring the existing bugs. If you want to say the forked Hive 1.2.1
> is
> > stabler than XXX, please give us the evidence. Then, we can fix it.
> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
> >
> > Historically, the following forked Hive 1.2.1 has never been stable.
> > It's just frozen. Since the forked Hive is out of our control, we
> ignored bugs.
> > That's all. The reality is a way far from the stable status.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
> (2015 August)
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
> (2016 April)
> >
> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
> 1.2.3,
> >
> > Apache Hive 1.2.2 has 50 bug fixes.
> > Apache Hive 1.2.3 has 9 bug fixes.
> >
> > I will not cover all of them, but Apache Hive community also backports
> > important patches like Apache Spark community.
> >
> > Second, let's move to SPARK issues because we aren't exposed to all Hive
> issues.
> >
> > SPARK-19109 ORC metadata section can sometimes exceed protobuf
> message size limit
> > SPARK-22267 Spark SQL incorrectly reads ORC file when column order
> is different
> >
> > These were reported since Apache Spark 1.6.x because the forked Hive
> doesn't have
> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
> >
> > Since we couldn't update the frozen forked Hive, we added Apache ORC
> dependency
> > at SPARK-20682 (2.3.0), added a switching 

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Felix Cheung
Just to add - hive 1.2 fork is definitely not more stable. We know of a few 
critical bug fixes that we cherry picked into a fork of that fork to maintain 
ourselves.



From: Dongjoon Hyun 
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen 
Cc: dev 
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored 
> bugs.
> That's all. The reality is a way far from the stable status.
>
> https://mvnrepository.com/artifact/org.spark-project.hive/
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
>  (2015 August)
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
>  (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
> Apache Hive 1.2.2 has 50 bug fixes.
> Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive 
> issues.
>
> SPARK-19109 ORC metadata section can sometimes exceed protobuf message 
> size limit
> SPARK-22267 Spark SQL incorrectly reads ORC file when column order is 
> different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't 
> have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC 
> dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 
> (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 
> (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
> SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>


Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:

> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
> same old and buggy that's been there a while. "stable" in that sense
> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
> bug fixes that are important; the question isn't just 1.x releases.
>
> What I don't know is how much affects Spark, as it's a Hive client
> mostly. Clearly some do.
>
> I'd prefer making it the default in the POM for 3.0. Mostly on the
> grounds that its effects are on deployed clusters, not apps. And
> deployers can still choose a binary distro with 1.x or make the choice
> they want. Those that don't care should probably be nudged to 2.x.
> Spark 3.x is already full of behavior changes and 'unstable', so I
> think this is minor relative to the overall risk question.
>
> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > I'm sending this email because it's important to discuss this topic
> narrowly
> > and make a clear conclusion.
> >
> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> > by ignoring the existing bugs. If you want to say the forked Hive 1.2.1
> is
> > stabler than XXX, please give us the evidence. Then, we can fix it.
> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
> >
> > Historically, the following forked Hive 1.2.1 has never been stable.
> > It's just frozen. Since the forked Hive is out of our control, we
> ignored bugs.
> > That's all. The reality is a way far from the stable status.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
> (2015 August)
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
> (2016 April)
> >
> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
> 1.2.3,
> >
> > Apache Hive 1.2.2 has 50 bug fixes.
> > Apache Hive 1.2.3 has 9 bug fixes.
> >
> > I will not cover all of them, but Apache Hive community also backports
> > important patches like Apache Spark community.
> >
> > Second, let's move to SPARK issues because we aren't exposed to all Hive
> issues.
> >
> > SPARK-19109 ORC metadata section can sometimes exceed protobuf
> message size limit
> > SPARK-22267 Spark SQL incorrectly reads ORC file when column order
> is different
> >
> > These were reported since Apache Spark 1.6.x because the forked Hive
> doesn't have
> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
> >
> > Since we couldn't update the frozen forked Hive, we added Apache ORC
> dependency
> > at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728
> (2.3.0),
> > tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279
> (2.4.0).
> > However, if you turn off the switch and start to use the forked hive,
> > you will be exposed to the buggy forked Hive 1.2.1 again.
> >
> > Third, let's talk about the new features like Hadoop 3 and JDK11.
> > No one believe that the ancient forked Hive 1.2.1 will work with this.
> > I saw that the following issue is mentioned as an evidence of Hive 2.3.6
> bug.
> >
> > SPARK-29245 ClassCastException during creating HiveMetaStoreClient
> >
> > Yes. I know that issue because I reported it and verified HIVE-21508.
> > It's fixed already and will be released ad Apache Hive 2.3.7.
> >
> > Can we imagine something like this in the forked Hive 1.2.1?
> > 'No'. There is no future on it. It's frozen.
> >
> > From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> > I welcome all your positive and negative opinions.
> > Please share your concerns and problems and fix them together.
> > Apache Spark is an open source project we shared.
> >
> > Bests,
> > Dongjoon.
> >
>


The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Hi, All.

I'm sending this email because it's important to discuss this topic narrowly
and make a clear conclusion.

`The forked Hive 1.2.1 is stable`? It sounds like a myth we created
by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
stabler than XXX, please give us the evidence. Then, we can fix it.
Otherwise, let's stop making `The forked Hive 1.2.1` invincible.

Historically, the following forked Hive 1.2.1 has never been stable.
It's just frozen. Since the forked Hive is out of our control, we ignored
bugs.
That's all. The reality is a way far from the stable status.

https://mvnrepository.com/artifact/org.spark-project.hive/

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
(2015 August)

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
(2016 April)

First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
1.2.3,

Apache Hive 1.2.2 has 50 bug fixes.
Apache Hive 1.2.3 has 9 bug fixes.

I will not cover all of them, but Apache Hive community also backports
important patches like Apache Spark community.

Second, let's move to SPARK issues because we aren't exposed to all Hive
issues.

SPARK-19109 ORC metadata section can sometimes exceed protobuf message
size limit
SPARK-22267 Spark SQL incorrectly reads ORC file when column order is
different

These were reported since Apache Spark 1.6.x because the forked Hive
doesn't have
a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).

Since we couldn't update the frozen forked Hive, we added Apache ORC
dependency
at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728
(2.3.0),
tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279
(2.4.0).
However, if you turn off the switch and start to use the forked hive,
you will be exposed to the buggy forked Hive 1.2.1 again.

Third, let's talk about the new features like Hadoop 3 and JDK11.
No one believe that the ancient forked Hive 1.2.1 will work with this.
I saw that the following issue is mentioned as an evidence of Hive 2.3.6
bug.

SPARK-29245 ClassCastException during creating HiveMetaStoreClient

Yes. I know that issue because I reported it and verified HIVE-21508.
It's fixed already and will be released ad Apache Hive 2.3.7.

Can we imagine something like this in the forked Hive 1.2.1?
'No'. There is no future on it. It's frozen.

>From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
I welcome all your positive and negative opinions.
Please share your concerns and problems and fix them together.
Apache Spark is an open source project we shared.

Bests,
Dongjoon.


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:

> Yes, good point. A user would get whatever the POM says without
> profiles enabled so it matters.
>
> Playing it out, an app _should_ compile with the Spark dependency
> marked 'provided'. In that case the app that is spark-submit-ted is
> agnostic to the Hive dependency as the only one that matters is what's
> on the cluster. Right? we don't leak through the Hive API in the Spark
> API. And yes it's then up to the cluster to provide whatever version
> it wants. Vendors will have made a specific version choice when
> building their distro one way or the other.
>
> If you run a Spark cluster yourself, you're using the binary distro,
> and we're already talking about also publishing a binary distro with
> this variation, so that's not the issue.
>
> The corner cases where it might matter are:
>
> - I unintentionally package Spark in the app and by default pull in
> Hive 2 when I will deploy against Hive 1. But that's user error, and
> causes other problems
> - I run tests locally in my project, which will pull in a default
> version of Hive defined by the POM
>
> Double-checking, is that right? if so it kind of implies it doesn't
> matter. Which is an argument either way about what's the default. I
> too would then prefer defaulting to Hive 2 in the POM. Am I missing
> something about the implication?
>
> (That fork will stay published forever anyway, that's not an issue per se.)
>
> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
> wrote:
> > Sean, our published POM is pointing and advertising the illegitimate
> Hive 1.2 fork as a compile dependency.
> > Yes. It can be overridden. So, why does Apache Spark need to publish
> like that?
> > If someone want to use that illegitimate Hive 1.2 fork, let them
> override it. We are unable to delete those illegitimate Hive 1.2 fork.
> > Those artifacts will be orphans.
> >
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Mridul Muralidharan
Just for completeness sake, spark is not version neutral to hadoop;
particularly in yarn mode, there is a minimum version requirement
(though fairly generous I believe).

I agree with Steve, it is a long standing pain that we are bundling a
positively ancient version of hive.
Having said that, we should decouple the hive artifact question from
the hadoop version question - though they might be related currently.

Regards,
Mridul

On Tue, Nov 19, 2019 at 2:40 PM Cheng Lian  wrote:
>
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version matters 
> except for the spark-hadoop-cloud module, which is only meaningful under the 
> hadoop-3.2 profile. All  the other spark-* artifacts published to Maven 
> central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark 
> distribution. Right now, we only publish PySpark artifacts prebuilt with 
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to 3.2 
> is feasible for PySpark users. Or maybe we should publish PySpark prebuilt 
> with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the 
> proposed hive-2.3 profile, I personally don't have a preference over having 
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing the 
> release management work, in case we decided to publish other spark-* Maven 
> artifacts from a Hadoop 2.7 build, we can still special case 
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun  wrote:
>>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and 
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung  
>> wrote:
>>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It 
>>> is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people are 
>>> concerned?
>>>
>>> 
>>> From: Steve Loughran 
>>> Sent: Sunday, November 17, 2019 9:22:09 AM
>>> To: Cheng Lian 
>>> Cc: Sean Owen ; Wenchen Fan ; 
>>> Dongjoon Hyun ; dev ; Yuming 
>>> Wang 
>>> Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which 
>>> spark has historically bundled (the org.spark-project one) is an orphan 
>>> project put together to deal with Hive's shading issues and a source of 
>>> unhappiness in the Hive project. What ever get shipped should do its best 
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest 
>>> move from a risk minimisation perspective. If something has broken then it 
>>> is you can start with the assumption that it is in the o.a.s packages 
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if 
>>> there are problems with the hadoop / hive dependencies those teams will 
>>> inevitably ignore filed bug reports for the same reason spark team will 
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the 
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that 
>>> in mind. It's not been tested, it has dependencies on artifacts we know are 
>>> incompatible, and as far as the Hadoop project is concerned: people should 
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts (a) 
>>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. 
>>> That way people doing things with their own projects will get up-to-date 
>>> dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" 
>>> branch-2 release and then declare its predecessors EOL; 2.10 will be the 
>>> transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I 
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which 
>>> seemed risky, and therefore we only introduced Hive 2.3 under the 
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong 
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that 
>>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not 
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Ben Sidhom
That sounds great!

On Wed, Nov 20, 2019 at 9:02 AM John Zhuge  wrote:

> That will be great. Please send us the invite.
>
> On Wed, Nov 20, 2019 at 8:56 AM bo yang  wrote:
>
>> Cool, thanks Ryan, John, Amogh for the reply! Great to see you
>> interested! Felix will have a Spark Scalability & Reliability Sync
>> meeting on Dec 4 1pm PST. We could discuss more details there. Do you want
>> to join?
>>
>> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:
>>
>>> We at Qubole are also looking at disaggregating shuffle on Spark. Would
>>> love to collaborate and share learnings.
>>>
>>> Regards,
>>> Amogh
>>>
>>> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>>>
 Great work, Bo! Would love to hear the details.


 On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
 wrote:

> I'm interested in remote shuffle services as well. I'd love to hear
> about what you're using in production!
>
> rb
>
> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>
>> Hi Ben,
>>
>> Thanks for the writing up! This is Bo from Uber. I am in Felix's team
>> in Seattle, and working on disaggregated shuffle (we called it remote
>> shuffle service, RSS, internally). We have put RSS into production for a
>> while, and learned a lot during the work (tried quite a few techniques to
>> improve the remote shuffle performance). We could share our learning with
>> the community, and also would like to hear feedback/suggestions on how to
>> further improve remote shuffle performance. We could chat more details if
>> you or other people are interested.
>>
>> Best,
>> Bo
>>
>> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
>> wrote:
>>
>>> I would like to start a conversation about extending the Spark
>>> shuffle manager surface to support fully disaggregated shuffle
>>> implementations. This is closely related to the work in SPARK-25299
>>> , which is
>>> focused on refactoring the shuffle manager API (and in particular,
>>> SortShuffleManager) to use a pluggable storage backend. The motivation 
>>> for
>>> that SPIP is further enabling Spark on Kubernetes.
>>>
>>>
>>> The motivation for this proposal is enabling full externalized
>>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>>> shuffle
>>> 
>>> is one example of such a disaggregated shuffle service.) These changes
>>> allow the bulk of the shuffle to run in a remote service such that 
>>> minimal
>>> state resides in executors and local disk spill is minimized. The net
>>> effect is increased job stability and performance improvements in 
>>> certain
>>> scenarios. These changes should work well with or are complementary to
>>> SPARK-25299. Some or all points may be merged into that issue as
>>> appropriate.
>>>
>>>
>>> Below is a description of each component of this proposal. These
>>> changes can ideally be introduced incrementally. I would like to gather
>>> feedback and gauge interest from others in the community to collaborate 
>>> on
>>> this. There are likely more points that would  be useful to 
>>> disaggregated
>>> shuffle services. We can outline a more concrete plan after gathering
>>> enough input. A working session could help us kick off this joint 
>>> effort;
>>> maybe something in the mid-January to mid-February timeframe (depending 
>>> on
>>> interest and availability. I’m happy to host at our Sunnyvale, CA 
>>> offices.
>>>
>>>
>>> ProposalScheduling and re-executing tasks
>>>
>>> Allow coordination between the service and the Spark DAG scheduler
>>> as to whether a given block/partition needs to be recomputed when a task
>>> fails or when shuffle block data cannot be read. Having such 
>>> coordination
>>> is important, e.g., for suppressing recomputation after aborted 
>>> executors
>>> or for forcing late recomputation if the service internally acts as a
>>> cache. One catchall solution is to have the shuffle manager provide an
>>> indication of whether shuffle data is external to executors (or nodes).
>>> Another option: allow the shuffle manager (likely on the driver) to be
>>> queried for the existence of shuffle data for a given executor ID (or
>>> perhaps map task, reduce task, etc). Note that this is at the level of 
>>> data
>>> the scheduler is aware of (i.e., map/reduce partitions) rather than 
>>> block
>>> IDs, which are internal details for some shuffle managers.
>>> ShuffleManager API
>>>
>>> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that
>>> the service knows that data is still active. This is one way to enable
>>> 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread John Zhuge
That will be great. Please send us the invite.

On Wed, Nov 20, 2019 at 8:56 AM bo yang  wrote:

> Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested!
> Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm
> PST. We could discuss more details there. Do you want to join?
>
> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:
>
>> We at Qubole are also looking at disaggregating shuffle on Spark. Would
>> love to collaborate and share learnings.
>>
>> Regards,
>> Amogh
>>
>> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>>
>>> Great work, Bo! Would love to hear the details.
>>>
>>>
>>> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
>>> wrote:
>>>
 I'm interested in remote shuffle services as well. I'd love to hear
 about what you're using in production!

 rb

 On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:

> Hi Ben,
>
> Thanks for the writing up! This is Bo from Uber. I am in Felix's team
> in Seattle, and working on disaggregated shuffle (we called it remote
> shuffle service, RSS, internally). We have put RSS into production for a
> while, and learned a lot during the work (tried quite a few techniques to
> improve the remote shuffle performance). We could share our learning with
> the community, and also would like to hear feedback/suggestions on how to
> further improve remote shuffle performance. We could chat more details if
> you or other people are interested.
>
> Best,
> Bo
>
> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
> wrote:
>
>> I would like to start a conversation about extending the Spark
>> shuffle manager surface to support fully disaggregated shuffle
>> implementations. This is closely related to the work in SPARK-25299
>> , which is
>> focused on refactoring the shuffle manager API (and in particular,
>> SortShuffleManager) to use a pluggable storage backend. The motivation 
>> for
>> that SPIP is further enabling Spark on Kubernetes.
>>
>>
>> The motivation for this proposal is enabling full externalized
>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>> shuffle
>> 
>> is one example of such a disaggregated shuffle service.) These changes
>> allow the bulk of the shuffle to run in a remote service such that 
>> minimal
>> state resides in executors and local disk spill is minimized. The net
>> effect is increased job stability and performance improvements in certain
>> scenarios. These changes should work well with or are complementary to
>> SPARK-25299. Some or all points may be merged into that issue as
>> appropriate.
>>
>>
>> Below is a description of each component of this proposal. These
>> changes can ideally be introduced incrementally. I would like to gather
>> feedback and gauge interest from others in the community to collaborate 
>> on
>> this. There are likely more points that would  be useful to disaggregated
>> shuffle services. We can outline a more concrete plan after gathering
>> enough input. A working session could help us kick off this joint effort;
>> maybe something in the mid-January to mid-February timeframe (depending 
>> on
>> interest and availability. I’m happy to host at our Sunnyvale, CA 
>> offices.
>>
>>
>> ProposalScheduling and re-executing tasks
>>
>> Allow coordination between the service and the Spark DAG scheduler as
>> to whether a given block/partition needs to be recomputed when a task 
>> fails
>> or when shuffle block data cannot be read. Having such coordination is
>> important, e.g., for suppressing recomputation after aborted executors or
>> for forcing late recomputation if the service internally acts as a cache.
>> One catchall solution is to have the shuffle manager provide an 
>> indication
>> of whether shuffle data is external to executors (or nodes). Another
>> option: allow the shuffle manager (likely on the driver) to be queried 
>> for
>> the existence of shuffle data for a given executor ID (or perhaps map 
>> task,
>> reduce task, etc). Note that this is at the level of data the scheduler 
>> is
>> aware of (i.e., map/reduce partitions) rather than block IDs, which are
>> internal details for some shuffle managers.
>> ShuffleManager API
>>
>> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that
>> the service knows that data is still active. This is one way to enable
>> time-/job-scoped data because a disaggregated shuffle service cannot rely
>> on robust communication with Spark and in general has a distinct 
>> lifecycle
>> from the Spark deployment(s) it talks 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread bo yang
Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested!
Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm
PST. We could discuss more details there. Do you want to join?

On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:

> We at Qubole are also looking at disaggregating shuffle on Spark. Would
> love to collaborate and share learnings.
>
> Regards,
> Amogh
>
> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>
>> Great work, Bo! Would love to hear the details.
>>
>>
>> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
>> wrote:
>>
>>> I'm interested in remote shuffle services as well. I'd love to hear
>>> about what you're using in production!
>>>
>>> rb
>>>
>>> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>>>
 Hi Ben,

 Thanks for the writing up! This is Bo from Uber. I am in Felix's team
 in Seattle, and working on disaggregated shuffle (we called it remote
 shuffle service, RSS, internally). We have put RSS into production for a
 while, and learned a lot during the work (tried quite a few techniques to
 improve the remote shuffle performance). We could share our learning with
 the community, and also would like to hear feedback/suggestions on how to
 further improve remote shuffle performance. We could chat more details if
 you or other people are interested.

 Best,
 Bo

 On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
 wrote:

> I would like to start a conversation about extending the Spark shuffle
> manager surface to support fully disaggregated shuffle implementations.
> This is closely related to the work in SPARK-25299
> , which is focused
> on refactoring the shuffle manager API (and in particular,
> SortShuffleManager) to use a pluggable storage backend. The motivation for
> that SPIP is further enabling Spark on Kubernetes.
>
>
> The motivation for this proposal is enabling full externalized
> (disaggregated) shuffle service implementations. (Facebook’s Cosco
> shuffle
> 
> is one example of such a disaggregated shuffle service.) These changes
> allow the bulk of the shuffle to run in a remote service such that minimal
> state resides in executors and local disk spill is minimized. The net
> effect is increased job stability and performance improvements in certain
> scenarios. These changes should work well with or are complementary to
> SPARK-25299. Some or all points may be merged into that issue as
> appropriate.
>
>
> Below is a description of each component of this proposal. These
> changes can ideally be introduced incrementally. I would like to gather
> feedback and gauge interest from others in the community to collaborate on
> this. There are likely more points that would  be useful to disaggregated
> shuffle services. We can outline a more concrete plan after gathering
> enough input. A working session could help us kick off this joint effort;
> maybe something in the mid-January to mid-February timeframe (depending on
> interest and availability. I’m happy to host at our Sunnyvale, CA offices.
>
>
> ProposalScheduling and re-executing tasks
>
> Allow coordination between the service and the Spark DAG scheduler as
> to whether a given block/partition needs to be recomputed when a task 
> fails
> or when shuffle block data cannot be read. Having such coordination is
> important, e.g., for suppressing recomputation after aborted executors or
> for forcing late recomputation if the service internally acts as a cache.
> One catchall solution is to have the shuffle manager provide an indication
> of whether shuffle data is external to executors (or nodes). Another
> option: allow the shuffle manager (likely on the driver) to be queried for
> the existence of shuffle data for a given executor ID (or perhaps map 
> task,
> reduce task, etc). Note that this is at the level of data the scheduler is
> aware of (i.e., map/reduce partitions) rather than block IDs, which are
> internal details for some shuffle managers.
> ShuffleManager API
>
> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that
> the service knows that data is still active. This is one way to enable
> time-/job-scoped data because a disaggregated shuffle service cannot rely
> on robust communication with Spark and in general has a distinct lifecycle
> from the Spark deployment(s) it talks to. This would likely take the form
> of a callback on ShuffleManager itself, but there are other approaches.
>
>
> Add lifecycle hooks to shuffle readers and writers (e.g., to
> close/recycle connections/streams/file handles as well as provide commit

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Sean Owen
Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun  wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 
> fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like 
> that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. 
> We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org