Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
I am +1 to take a look and participate in continuous shuffle work, while push-based shuffle is being added. To be honest, I feel it might be hard to get people’s hard commitment on this, as it depends on progress of another SPIP, and timeline for discussion/work can be several months later. Thanks, Cheng Su From: Jungtaek Lim Date: Tuesday, September 15, 2020 at 5:04 PM To: Joseph Torres Cc: Sean Owen , dev Subject: Re: [DISCUSS] Time to evaluate "continuous mode" in SS? Yeah I realized there's a proposal for push-based shuffle, and I agree that may unblock the architectural issue on true-streaming. (The root concern of the continuous mode has been that it doesn't fit with the architecture of Spark, and probably push-based shuffle could persuade me.) I guess push-based shuffle is not the only blocker to make continuous mode be stateful (all of the assumptions on microbatch are broken in the mode, like global watermark, distributed checkpoint without stopping every tasks, etc.), but even just repartitioning (probably easier to achieve) is still a good improvement for the continuous mode. If someone is promising to look into the improvement after the push-based shuffle, I agree that is a good reason to keep continuous mode in place. On Tue, Sep 15, 2020 at 11:02 PM Joseph Torres mailto:joseph.tor...@databricks.com>> wrote: It's worth noting that the push-based shuffle SPIP currently in progress addresses a substantial blocker in the area. If you remember when we removed the half-finished stateful query support, the lack of that functionality and the challenge of implementing it is basically why it was half-finished. I can't make a hard commitment, but I do plan to take a look at how easy it would be to build continuous shuffle support on top of the SPIP once it's in, and continuous mode is gonna be a lot more useful if most (all?) queries can run using it. On Tue, Sep 15, 2020 at 6:37 AM Sean Owen mailto:sro...@gmail.com>> wrote: I think we certainly can't remove it without deprecation and a few releases. If there were big problems with it that weren't getting fixed, sure maybe, but lack of interest in reviewing minor changes isn't necessarily a bad sign. By the same logic you'd delete graphx long ago. Anecdotally, yes there are people using it that I know of at least, but I wouldn't know a lot of them. I think the question is, is it causing a problem, like a lot of maintenance? doesn't sound like it. On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> wrote: > > Probably it would depend on the meaning of "experimental". My understanding > of "experimental" is more likely "incubation", which may be graduated > finally, or may be retired. > > To be clear, I'm evaluating the continuous mode as "candidate to retire", > unless there are actual use cases in production and at least a couple of > community members volunteer to maintain it. As far as I see the activity in a > year, there's no interest for the continuous mode in community members. I can > refer to at least three PRs which suffered to find reviewers (around 1 year) > and closed on inactivity. No improvements/bug fixes except trivials. It > doesn't seem to get some traction - few questions in SO, a few posts in > google search results which were all posted around the date when continuous > mode was introduced. Though I would be convinced if someone could provide > meaningful numbers of actual use cases. > > If the answer really has to be taken between un-experimental or not (which > says retirement is not an option), I'd rather vote to leave as experimental, > so I just keep forgetting about it. Actually it bothers sometimes even if the > change is done in micro-batch side (so that's not a zero cost to maintain), > but still better than officially supporting it. > > > On Tue, Sep 15, 2020 at 9:08 PM Sean Owen > mailto:sro...@gmail.com>> wrote: >> >> If you're suggesting making it un-Experimental, probably yes, as it is >> de facto not going to change much I expect. >> If you're saying remove it, probably not? I don't see that it's >> anywhere near deprecated, and not sure it's unmaintained - obviously >> tests etc still have to keep passing. >> >> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim >> mailto:kabhwan.opensou...@gmail.com>> wrote: >> > >> > Hi devs, >> > >> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in >> > Structured Streaming as "experimental". >> > >> > Now we are here at 2.5 years after its release - I feel it would be a good >> > time to evaluate the mode, whether the mode has been widely used or not, >> > and the mode has been making
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
Yeah I realized there's a proposal for push-based shuffle, and I agree that may unblock the architectural issue on true-streaming. (The root concern of the continuous mode has been that it doesn't fit with the architecture of Spark, and probably push-based shuffle could persuade me.) I guess push-based shuffle is not the only blocker to make continuous mode be stateful (all of the assumptions on microbatch are broken in the mode, like global watermark, distributed checkpoint without stopping every tasks, etc.), but even just repartitioning (probably easier to achieve) is still a good improvement for the continuous mode. If someone is promising to look into the improvement after the push-based shuffle, I agree that is a good reason to keep continuous mode in place. On Tue, Sep 15, 2020 at 11:02 PM Joseph Torres wrote: > It's worth noting that the push-based shuffle SPIP currently in progress > addresses a substantial blocker in the area. If you remember when we > removed the half-finished stateful query support, the lack of that > functionality and the challenge of implementing it is basically why it was > half-finished. I can't make a hard commitment, but I do plan to take a look > at how easy it would be to build continuous shuffle support on top of the > SPIP once it's in, and continuous mode is gonna be a lot more useful if > most (all?) queries can run using it. > > On Tue, Sep 15, 2020 at 6:37 AM Sean Owen wrote: > >> I think we certainly can't remove it without deprecation and a few >> releases. If there were big problems with it that weren't getting >> fixed, sure maybe, but lack of interest in reviewing minor changes >> isn't necessarily a bad sign. By the same logic you'd delete graphx >> long ago. >> >> Anecdotally, yes there are people using it that I know of at least, >> but I wouldn't know a lot of them. >> I think the question is, is it causing a problem, like a lot of >> maintenance? doesn't sound like it. >> >> On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim >> wrote: >> > >> > Probably it would depend on the meaning of "experimental". My >> understanding of "experimental" is more likely "incubation", which may be >> graduated finally, or may be retired. >> > >> > To be clear, I'm evaluating the continuous mode as "candidate to >> retire", unless there are actual use cases in production and at least a >> couple of community members volunteer to maintain it. As far as I see the >> activity in a year, there's no interest for the continuous mode in >> community members. I can refer to at least three PRs which suffered to find >> reviewers (around 1 year) and closed on inactivity. No improvements/bug >> fixes except trivials. It doesn't seem to get some traction - few questions >> in SO, a few posts in google search results which were all posted around >> the date when continuous mode was introduced. Though I would be convinced >> if someone could provide meaningful numbers of actual use cases. >> > >> > If the answer really has to be taken between un-experimental or not >> (which says retirement is not an option), I'd rather vote to leave as >> experimental, so I just keep forgetting about it. Actually it bothers >> sometimes even if the change is done in micro-batch side (so that's not a >> zero cost to maintain), but still better than officially supporting it. >> > >> > >> > On Tue, Sep 15, 2020 at 9:08 PM Sean Owen wrote: >> >> >> >> If you're suggesting making it un-Experimental, probably yes, as it is >> >> de facto not going to change much I expect. >> >> If you're saying remove it, probably not? I don't see that it's >> >> anywhere near deprecated, and not sure it's unmaintained - obviously >> >> tests etc still have to keep passing. >> >> >> >> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim >> >> wrote: >> >> > >> >> > Hi devs, >> >> > >> >> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in >> Structured Streaming as "experimental". >> >> > >> >> > Now we are here at 2.5 years after its release - I feel it would be >> a good time to evaluate the mode, whether the mode has been widely used or >> not, and the mode has been making progress, as the mode is "experimental". >> >> > >> >> > At least from the surface I don't see any active effort for >> continuous mode around the community - the last major effort was stateful >> operation which was incomplete and I removed that. There were some couples >> of bug reports as well as fixes more than a year ago and almost nothing has >> been handled. (A trivial bugfix PR has been merged recently but that's >> all.) The new features introduced to the Structured Streaming (at least >> observable metrics, SS UI) don't apply to continuous mode, and no one made >> "support continuous mode" as a hard requirement on passing review in these >> PRs. >> >> > >> >> > I have no idea how many companies are using the mode in production >> (please add the voice if someone has statistics about this) but I don't see >> any bug reports
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
Hi Joseph, Would be interested in discussing your thoughts for how push-based shuffle could help with continuous mode in SS. We have discussed internally at LinkedIn with our Samza peers as well as with Alibaba Flink team for applicability of push-based shuffle on streaming engines, especially for continuous operation mode. Would be interested to know your thoughts for how that can apply to SS continuous mode. - Min Shen Staff Software Engineer LinkedIn -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
It's worth noting that the push-based shuffle SPIP currently in progress addresses a substantial blocker in the area. If you remember when we removed the half-finished stateful query support, the lack of that functionality and the challenge of implementing it is basically why it was half-finished. I can't make a hard commitment, but I do plan to take a look at how easy it would be to build continuous shuffle support on top of the SPIP once it's in, and continuous mode is gonna be a lot more useful if most (all?) queries can run using it. On Tue, Sep 15, 2020 at 6:37 AM Sean Owen wrote: > I think we certainly can't remove it without deprecation and a few > releases. If there were big problems with it that weren't getting > fixed, sure maybe, but lack of interest in reviewing minor changes > isn't necessarily a bad sign. By the same logic you'd delete graphx > long ago. > > Anecdotally, yes there are people using it that I know of at least, > but I wouldn't know a lot of them. > I think the question is, is it causing a problem, like a lot of > maintenance? doesn't sound like it. > > On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim > wrote: > > > > Probably it would depend on the meaning of "experimental". My > understanding of "experimental" is more likely "incubation", which may be > graduated finally, or may be retired. > > > > To be clear, I'm evaluating the continuous mode as "candidate to > retire", unless there are actual use cases in production and at least a > couple of community members volunteer to maintain it. As far as I see the > activity in a year, there's no interest for the continuous mode in > community members. I can refer to at least three PRs which suffered to find > reviewers (around 1 year) and closed on inactivity. No improvements/bug > fixes except trivials. It doesn't seem to get some traction - few questions > in SO, a few posts in google search results which were all posted around > the date when continuous mode was introduced. Though I would be convinced > if someone could provide meaningful numbers of actual use cases. > > > > If the answer really has to be taken between un-experimental or not > (which says retirement is not an option), I'd rather vote to leave as > experimental, so I just keep forgetting about it. Actually it bothers > sometimes even if the change is done in micro-batch side (so that's not a > zero cost to maintain), but still better than officially supporting it. > > > > > > On Tue, Sep 15, 2020 at 9:08 PM Sean Owen wrote: > >> > >> If you're suggesting making it un-Experimental, probably yes, as it is > >> de facto not going to change much I expect. > >> If you're saying remove it, probably not? I don't see that it's > >> anywhere near deprecated, and not sure it's unmaintained - obviously > >> tests etc still have to keep passing. > >> > >> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim > >> wrote: > >> > > >> > Hi devs, > >> > > >> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in > Structured Streaming as "experimental". > >> > > >> > Now we are here at 2.5 years after its release - I feel it would be a > good time to evaluate the mode, whether the mode has been widely used or > not, and the mode has been making progress, as the mode is "experimental". > >> > > >> > At least from the surface I don't see any active effort for > continuous mode around the community - the last major effort was stateful > operation which was incomplete and I removed that. There were some couples > of bug reports as well as fixes more than a year ago and almost nothing has > been handled. (A trivial bugfix PR has been merged recently but that's > all.) The new features introduced to the Structured Streaming (at least > observable metrics, SS UI) don't apply to continuous mode, and no one made > "support continuous mode" as a hard requirement on passing review in these > PRs. > >> > > >> > I have no idea how many companies are using the mode in production > (please add the voice if someone has statistics about this) but I don't see > any bug reports recently, and see only a few questions in SO, which makes > me think about cost on maintenance. > >> > > >> > I know there's a mood to avoid discontinue support as possible, but > it sounds weird to keep something as "unmaintained", especially it's still > "experimental" and main authors are no more active enough to promise > maintenance/improvement on the module. Thoughts? > >> > > >> > Thanks, > >> > Jungtaek Lim (HeartSaVioR) > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
I think we certainly can't remove it without deprecation and a few releases. If there were big problems with it that weren't getting fixed, sure maybe, but lack of interest in reviewing minor changes isn't necessarily a bad sign. By the same logic you'd delete graphx long ago. Anecdotally, yes there are people using it that I know of at least, but I wouldn't know a lot of them. I think the question is, is it causing a problem, like a lot of maintenance? doesn't sound like it. On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim wrote: > > Probably it would depend on the meaning of "experimental". My understanding > of "experimental" is more likely "incubation", which may be graduated > finally, or may be retired. > > To be clear, I'm evaluating the continuous mode as "candidate to retire", > unless there are actual use cases in production and at least a couple of > community members volunteer to maintain it. As far as I see the activity in a > year, there's no interest for the continuous mode in community members. I can > refer to at least three PRs which suffered to find reviewers (around 1 year) > and closed on inactivity. No improvements/bug fixes except trivials. It > doesn't seem to get some traction - few questions in SO, a few posts in > google search results which were all posted around the date when continuous > mode was introduced. Though I would be convinced if someone could provide > meaningful numbers of actual use cases. > > If the answer really has to be taken between un-experimental or not (which > says retirement is not an option), I'd rather vote to leave as experimental, > so I just keep forgetting about it. Actually it bothers sometimes even if the > change is done in micro-batch side (so that's not a zero cost to maintain), > but still better than officially supporting it. > > > On Tue, Sep 15, 2020 at 9:08 PM Sean Owen wrote: >> >> If you're suggesting making it un-Experimental, probably yes, as it is >> de facto not going to change much I expect. >> If you're saying remove it, probably not? I don't see that it's >> anywhere near deprecated, and not sure it's unmaintained - obviously >> tests etc still have to keep passing. >> >> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim >> wrote: >> > >> > Hi devs, >> > >> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in >> > Structured Streaming as "experimental". >> > >> > Now we are here at 2.5 years after its release - I feel it would be a good >> > time to evaluate the mode, whether the mode has been widely used or not, >> > and the mode has been making progress, as the mode is "experimental". >> > >> > At least from the surface I don't see any active effort for continuous >> > mode around the community - the last major effort was stateful operation >> > which was incomplete and I removed that. There were some couples of bug >> > reports as well as fixes more than a year ago and almost nothing has been >> > handled. (A trivial bugfix PR has been merged recently but that's all.) >> > The new features introduced to the Structured Streaming (at least >> > observable metrics, SS UI) don't apply to continuous mode, and no one made >> > "support continuous mode" as a hard requirement on passing review in these >> > PRs. >> > >> > I have no idea how many companies are using the mode in production (please >> > add the voice if someone has statistics about this) but I don't see any >> > bug reports recently, and see only a few questions in SO, which makes me >> > think about cost on maintenance. >> > >> > I know there's a mood to avoid discontinue support as possible, but it >> > sounds weird to keep something as "unmaintained", especially it's still >> > "experimental" and main authors are no more active enough to promise >> > maintenance/improvement on the module. Thoughts? >> > >> > Thanks, >> > Jungtaek Lim (HeartSaVioR) - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
Probably it would depend on the meaning of "experimental". My understanding of "experimental" is more likely "incubation", which may be graduated finally, or may be retired. To be clear, I'm evaluating the continuous mode as "candidate to retire", unless there are actual use cases in production and at least a couple of community members volunteer to maintain it. As far as I see the activity in a year, there's no interest for the continuous mode in community members. I can refer to at least three PRs which suffered to find reviewers (around 1 year) and closed on inactivity. No improvements/bug fixes except trivials. It doesn't seem to get some traction - few questions in SO, a few posts in google search results which were all posted around the date when continuous mode was introduced. Though I would be convinced if someone could provide meaningful numbers of actual use cases. If the answer really has to be taken between un-experimental or not (which says retirement is not an option), I'd rather vote to leave as experimental, so I just keep forgetting about it. Actually it bothers sometimes even if the change is done in micro-batch side (so that's not a zero cost to maintain), but still better than officially supporting it. On Tue, Sep 15, 2020 at 9:08 PM Sean Owen wrote: > If you're suggesting making it un-Experimental, probably yes, as it is > de facto not going to change much I expect. > If you're saying remove it, probably not? I don't see that it's > anywhere near deprecated, and not sure it's unmaintained - obviously > tests etc still have to keep passing. > > On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim > wrote: > > > > Hi devs, > > > > It was Spark 2.3 in Feb 2018 which introduced continuous mode in > Structured Streaming as "experimental". > > > > Now we are here at 2.5 years after its release - I feel it would be a > good time to evaluate the mode, whether the mode has been widely used or > not, and the mode has been making progress, as the mode is "experimental". > > > > At least from the surface I don't see any active effort for continuous > mode around the community - the last major effort was stateful operation > which was incomplete and I removed that. There were some couples of bug > reports as well as fixes more than a year ago and almost nothing has been > handled. (A trivial bugfix PR has been merged recently but that's all.) The > new features introduced to the Structured Streaming (at least observable > metrics, SS UI) don't apply to continuous mode, and no one made "support > continuous mode" as a hard requirement on passing review in these PRs. > > > > I have no idea how many companies are using the mode in production > (please add the voice if someone has statistics about this) but I don't see > any bug reports recently, and see only a few questions in SO, which makes > me think about cost on maintenance. > > > > I know there's a mood to avoid discontinue support as possible, but it > sounds weird to keep something as "unmaintained", especially it's still > "experimental" and main authors are no more active enough to promise > maintenance/improvement on the module. Thoughts? > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) >
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
If you're suggesting making it un-Experimental, probably yes, as it is de facto not going to change much I expect. If you're saying remove it, probably not? I don't see that it's anywhere near deprecated, and not sure it's unmaintained - obviously tests etc still have to keep passing. On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim wrote: > > Hi devs, > > It was Spark 2.3 in Feb 2018 which introduced continuous mode in Structured > Streaming as "experimental". > > Now we are here at 2.5 years after its release - I feel it would be a good > time to evaluate the mode, whether the mode has been widely used or not, and > the mode has been making progress, as the mode is "experimental". > > At least from the surface I don't see any active effort for continuous mode > around the community - the last major effort was stateful operation which was > incomplete and I removed that. There were some couples of bug reports as well > as fixes more than a year ago and almost nothing has been handled. (A trivial > bugfix PR has been merged recently but that's all.) The new features > introduced to the Structured Streaming (at least observable metrics, SS UI) > don't apply to continuous mode, and no one made "support continuous mode" as > a hard requirement on passing review in these PRs. > > I have no idea how many companies are using the mode in production (please > add the voice if someone has statistics about this) but I don't see any bug > reports recently, and see only a few questions in SO, which makes me think > about cost on maintenance. > > I know there's a mood to avoid discontinue support as possible, but it sounds > weird to keep something as "unmaintained", especially it's still > "experimental" and main authors are no more active enough to promise > maintenance/improvement on the module. Thoughts? > > Thanks, > Jungtaek Lim (HeartSaVioR) - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Time to evaluate "continuous mode" in SS?
Hi Jungtaek, All I see at the moment is that most of the users choose Flink over Spark when continues processing is needed. Unless there is a revolution in this area there is no point to keep maintenance. 2.5 years is lot in bigdata industry. If there will be efforts in this area then happy to join to push this forward... BR, G On Tue, Sep 15, 2020 at 6:34 AM Jungtaek Lim wrote: > Hi devs, > > It was Spark 2.3 in Feb 2018 which introduced continuous mode in > Structured Streaming as "experimental". > > Now we are here at 2.5 years after its release - I feel it would be a good > time to evaluate the mode, whether the mode has been widely used or not, > and the mode has been making progress, as the mode is "experimental". > > At least from the surface I don't see any active effort for continuous > mode around the community - the last major effort was stateful operation > which was incomplete and I removed that. There were some couples of bug > reports as well as fixes more than a year ago and almost nothing has been > handled. (A trivial bugfix PR has been merged recently but that's all.) The > new features introduced to the Structured Streaming (at least observable > metrics, SS UI) don't apply to continuous mode, and no one made "support > continuous mode" as a hard requirement on passing review in these PRs. > > I have no idea how many companies are using the mode in production (please > add the voice if someone has statistics about this) but I don't see any bug > reports recently, and see only a few questions in SO, which makes me think > about cost on maintenance. > > I know there's a mood to avoid discontinue support as possible, but it > sounds weird to keep something as "unmaintained", especially it's still > "experimental" and main authors are no more active enough to promise > maintenance/improvement on the module. Thoughts? > > Thanks, > Jungtaek Lim (HeartSaVioR) >
[DISCUSS] Time to evaluate "continuous mode" in SS?
Hi devs, It was Spark 2.3 in Feb 2018 which introduced continuous mode in Structured Streaming as "experimental". Now we are here at 2.5 years after its release - I feel it would be a good time to evaluate the mode, whether the mode has been widely used or not, and the mode has been making progress, as the mode is "experimental". At least from the surface I don't see any active effort for continuous mode around the community - the last major effort was stateful operation which was incomplete and I removed that. There were some couples of bug reports as well as fixes more than a year ago and almost nothing has been handled. (A trivial bugfix PR has been merged recently but that's all.) The new features introduced to the Structured Streaming (at least observable metrics, SS UI) don't apply to continuous mode, and no one made "support continuous mode" as a hard requirement on passing review in these PRs. I have no idea how many companies are using the mode in production (please add the voice if someone has statistics about this) but I don't see any bug reports recently, and see only a few questions in SO, which makes me think about cost on maintenance. I know there's a mood to avoid discontinue support as possible, but it sounds weird to keep something as "unmaintained", especially it's still "experimental" and main authors are no more active enough to promise maintenance/improvement on the module. Thoughts? Thanks, Jungtaek Lim (HeartSaVioR)