Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-16 Thread Cheng Su
I am +1 to take a look and participate in continuous shuffle work, while 
push-based shuffle is being added. To be honest, I feel it might be hard to get 
people’s hard commitment on this, as it depends on progress of another SPIP, 
and timeline for discussion/work can be several months later.

Thanks,
Cheng Su

From: Jungtaek Lim 
Date: Tuesday, September 15, 2020 at 5:04 PM
To: Joseph Torres 
Cc: Sean Owen , dev 
Subject: Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

Yeah I realized there's a proposal for push-based shuffle, and I agree that may 
unblock the architectural issue on true-streaming. (The root concern of the 
continuous mode has been that it doesn't fit with the architecture of Spark, 
and probably push-based shuffle could persuade me.)

I guess push-based shuffle is not the only blocker to make continuous mode be 
stateful (all of the assumptions on microbatch are broken in the mode, like 
global watermark, distributed checkpoint without stopping every tasks, etc.), 
but even just repartitioning (probably easier to achieve) is still a good 
improvement for the continuous mode. If someone is promising to look into the 
improvement after the push-based shuffle, I agree that is a good reason to keep 
continuous mode in place.

On Tue, Sep 15, 2020 at 11:02 PM Joseph Torres 
mailto:joseph.tor...@databricks.com>> wrote:
It's worth noting that the push-based shuffle SPIP currently in progress 
addresses a substantial blocker in the area. If you remember when we removed 
the half-finished stateful query support, the lack of that functionality and 
the challenge of implementing it is basically why it was half-finished. I can't 
make a hard commitment, but I do plan to take a look at how easy it would be to 
build continuous shuffle support on top of the SPIP once it's in, and 
continuous mode is gonna be a lot more useful if most (all?) queries can run 
using it.

On Tue, Sep 15, 2020 at 6:37 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I think we certainly can't remove it without deprecation and a few
releases. If there were big problems with it that weren't getting
fixed, sure maybe, but lack of interest in reviewing minor changes
isn't necessarily a bad sign. By the same logic you'd delete graphx
long ago.

Anecdotally, yes there are people using it that I know of at least,
but I wouldn't know a lot of them.
I think the question is, is it causing a problem, like a lot of
maintenance? doesn't sound like it.

On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim
mailto:kabhwan.opensou...@gmail.com>> wrote:
>
> Probably it would depend on the meaning of "experimental". My understanding 
> of "experimental" is more likely "incubation", which may be graduated 
> finally, or may be retired.
>
> To be clear, I'm evaluating the continuous mode as "candidate to retire", 
> unless there are actual use cases in production and at least a couple of 
> community members volunteer to maintain it. As far as I see the activity in a 
> year, there's no interest for the continuous mode in community members. I can 
> refer to at least three PRs which suffered to find reviewers (around 1 year) 
> and closed on inactivity. No improvements/bug fixes except trivials. It 
> doesn't seem to get some traction - few questions in SO, a few posts in 
> google search results which were all posted around the date when continuous 
> mode was introduced. Though I would be convinced if someone could provide 
> meaningful numbers of actual use cases.
>
> If the answer really has to be taken between un-experimental or not (which 
> says retirement is not an option), I'd rather vote to leave as experimental, 
> so I just keep forgetting about it. Actually it bothers sometimes even if the 
> change is done in micro-batch side (so that's not a zero cost to maintain), 
> but still better than officially supporting it.
>
>
> On Tue, Sep 15, 2020 at 9:08 PM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
>>
>> If you're suggesting making it un-Experimental, probably yes, as it is
>> de facto not going to change much I expect.
>> If you're saying remove it, probably not? I don't see that it's
>> anywhere near deprecated, and not sure it's unmaintained - obviously
>> tests etc still have to keep passing.
>>
>> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim
>> mailto:kabhwan.opensou...@gmail.com>> wrote:
>> >
>> > Hi devs,
>> >
>> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in 
>> > Structured Streaming as "experimental".
>> >
>> > Now we are here at 2.5 years after its release - I feel it would be a good 
>> > time to evaluate the mode, whether the mode has been widely used or not, 
>> > and the mode has been making progress, as the mode is "experimental".
>> >
>> > At least from the surface I don't see any active effort for continuous 
>> > mode around the community - the last major effort was stateful operation 
>> > which was incomplete and I removed that. There were some couples of bug 
>> > reports as well as fixe

Spark-Locality: Hinting Spark location of the executor does not take effect

2020-09-16 Thread Priyanka Gomatam
Sending on behalf of a colleague whose mail isn’t reaching the dev list for 
some reason 😊

===

HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn’t help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Appreciate your inputs 😊

Thanks,
Nasrulla



Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-16 Thread Nasrulla Khan Haris
HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn’t help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Appreciate your inputs 😊

Thanks,
Nasrulla



Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-16 Thread Nasrulla Khan Haris
HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn't help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Thanks,
Nasrulla