Re: [DISCUSS] Spark 3.0 and DataSourceV2
Thanks for the discussion, everyone. Since there aren't many objections to the scope and we are aligned on what this commitment would mean, I've started a vote thread for it. rb On Wed, Feb 27, 2019 at 5:32 PM Wenchen Fan wrote: > I'm good with the list from Ryan, thanks! > > On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue wrote: > >> I think that's a good plan. Let's get the functionality done, but mark it >> experimental pending a new row API. >> >> So is there agreement on this set of work, then? >> >> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia >> wrote: >> >>> To add to this, we can add a stable interface anytime if the original >>> one was marked as unstable; we wouldn’t have to wait until 4.0. We had a >>> lot of APIs that were experimental in 2.0 and then got stabilized in later >>> 2.x releases for example. >>> >>> Matei >>> >>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin wrote: >>> > >>> > We will have to fix that before we declare dev2 is stable, because >>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. >>> > >>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah >>> wrote: >>> > Will that then require an API break down the line? Do we save that for >>> Spark 4? >>> > >>> > >>> > >>> > >>> > -Matt Cheah? >>> > >>> > >>> > >>> > From: Ryan Blue >>> > Reply-To: "rb...@netflix.com" >>> > Date: Tuesday, February 26, 2019 at 4:53 PM >>> > To: Matt Cheah >>> > Cc: Sean Owen , Wenchen Fan , >>> Xiao Li , Matei Zaharia , >>> Spark Dev List >>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >>> > >>> > >>> > >>> > That's a good question. >>> > >>> > >>> > >>> > While I'd love to have a solution for that, I don't think it is a good >>> idea to delay DSv2 until we have one. That is going to require a lot of >>> internal changes and I don't see how we could make the release date if we >>> are including an InternalRow replacement. >>> > >>> > >>> > >>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah >>> wrote: >>> > >>> > Reynold made a note earlier about a proper Row API that isn’t >>> InternalRow – is that still on the table? >>> > >>> > >>> > >>> > -Matt Cheah >>> > >>> > >>> > >>> > From: Ryan Blue >>> > Reply-To: "rb...@netflix.com" >>> > Date: Tuesday, February 26, 2019 at 4:40 PM >>> > To: Matt Cheah >>> > Cc: Sean Owen , Wenchen Fan , >>> Xiao Li , Matei Zaharia , >>> Spark Dev List >>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >>> > >>> > >>> > >>> > Thanks for bumping this, Matt. I think we can have the discussion here >>> to clarify exactly what we’re committing to and then have a vote thread >>> once we’re agreed. >>> > Getting back to the DSv2 discussion, I think we have a good handle on >>> what would be added: >>> > · Plugin system for catalogs >>> > >>> > · TableCatalog interface (I’ll start a vote thread for this >>> SPIP shortly) >>> > >>> > · TableCatalog implementation backed by SessionCatalog that >>> can load v2 tables >>> > >>> > · Resolution rule to load v2 tables using the new catalog >>> > >>> > · CTAS logical and physical plan nodes >>> > >>> > · Conversions from SQL parsed logical plans to v2 logical plans >>> > >>> > Initially, this will always use the v2 catalog backed by >>> SessionCatalog to avoid dependence on the multi-catalog work. All of those >>> are already implemented and working, so I think it is reasonable that we >>> can get them in. >>> > Then we can consider a few stretch goals: >>> > · Get in as much DDL as we can. I think create and drop table >>> should be easy. >>> > >>> > · Multi-catalog identifier parsing and multi-catalog support >>> > >>> > If we get those last two in, it would be great. We can make the call >>> closer to release time. Does anyone want to change this se
Re: [DISCUSS] Spark 3.0 and DataSourceV2
I'm good with the list from Ryan, thanks! On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue wrote: > I think that's a good plan. Let's get the functionality done, but mark it > experimental pending a new row API. > > So is there agreement on this set of work, then? > > On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia > wrote: > >> To add to this, we can add a stable interface anytime if the original one >> was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of >> APIs that were experimental in 2.0 and then got stabilized in later 2.x >> releases for example. >> >> Matei >> >> > On Feb 26, 2019, at 5:12 PM, Reynold Xin wrote: >> > >> > We will have to fix that before we declare dev2 is stable, because >> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. >> > >> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: >> > Will that then require an API break down the line? Do we save that for >> Spark 4? >> > >> > >> > >> > >> > -Matt Cheah? >> > >> > >> > >> > From: Ryan Blue >> > Reply-To: "rb...@netflix.com" >> > Date: Tuesday, February 26, 2019 at 4:53 PM >> > To: Matt Cheah >> > Cc: Sean Owen , Wenchen Fan , >> Xiao Li , Matei Zaharia , >> Spark Dev List >> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >> > >> > >> > >> > That's a good question. >> > >> > >> > >> > While I'd love to have a solution for that, I don't think it is a good >> idea to delay DSv2 until we have one. That is going to require a lot of >> internal changes and I don't see how we could make the release date if we >> are including an InternalRow replacement. >> > >> > >> > >> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: >> > >> > Reynold made a note earlier about a proper Row API that isn’t >> InternalRow – is that still on the table? >> > >> > >> > >> > -Matt Cheah >> > >> > >> > >> > From: Ryan Blue >> > Reply-To: "rb...@netflix.com" >> > Date: Tuesday, February 26, 2019 at 4:40 PM >> > To: Matt Cheah >> > Cc: Sean Owen , Wenchen Fan , >> Xiao Li , Matei Zaharia , >> Spark Dev List >> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 >> > >> > >> > >> > Thanks for bumping this, Matt. I think we can have the discussion here >> to clarify exactly what we’re committing to and then have a vote thread >> once we’re agreed. >> > Getting back to the DSv2 discussion, I think we have a good handle on >> what would be added: >> > · Plugin system for catalogs >> > >> > · TableCatalog interface (I’ll start a vote thread for this >> SPIP shortly) >> > >> > · TableCatalog implementation backed by SessionCatalog that can >> load v2 tables >> > >> > · Resolution rule to load v2 tables using the new catalog >> > >> > · CTAS logical and physical plan nodes >> > >> > · Conversions from SQL parsed logical plans to v2 logical plans >> > >> > Initially, this will always use the v2 catalog backed by SessionCatalog >> to avoid dependence on the multi-catalog work. All of those are already >> implemented and working, so I think it is reasonable that we can get them >> in. >> > Then we can consider a few stretch goals: >> > · Get in as much DDL as we can. I think create and drop table >> should be easy. >> > >> > · Multi-catalog identifier parsing and multi-catalog support >> > >> > If we get those last two in, it would be great. We can make the call >> closer to release time. Does anyone want to change this set of work? >> > >> > >> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: >> > >> > What would then be the next steps we'd take to collectively decide on >> plans and timelines moving forward? Might I suggest scheduling a conference >> call with appropriate PMCs to put our ideas together? Maybe such a >> discussion can take place at next week's meeting? Or do we need to have a >> separate formalized voting thread which is guided by a PMC? >> > >> > My suggestion is to try to make concrete steps forward and to avoid >> letting this slip through the cracks. >> > >> > I also thin
Re: [DISCUSS] Spark 3.0 and DataSourceV2
I think that's a good plan. Let's get the functionality done, but mark it experimental pending a new row API. So is there agreement on this set of work, then? On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia wrote: > To add to this, we can add a stable interface anytime if the original one > was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of > APIs that were experimental in 2.0 and then got stabilized in later 2.x > releases for example. > > Matei > > > On Feb 26, 2019, at 5:12 PM, Reynold Xin wrote: > > > > We will have to fix that before we declare dev2 is stable, because > InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. > > > > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > > Will that then require an API break down the line? Do we save that for > Spark 4? > > > > > > > > > > -Matt Cheah? > > > > > > > > From: Ryan Blue > > Reply-To: "rb...@netflix.com" > > Date: Tuesday, February 26, 2019 at 4:53 PM > > To: Matt Cheah > > Cc: Sean Owen , Wenchen Fan , > Xiao Li , Matei Zaharia , > Spark Dev List > > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > > > > > That's a good question. > > > > > > > > While I'd love to have a solution for that, I don't think it is a good > idea to delay DSv2 until we have one. That is going to require a lot of > internal changes and I don't see how we could make the release date if we > are including an InternalRow replacement. > > > > > > > > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: > > > > Reynold made a note earlier about a proper Row API that isn’t > InternalRow – is that still on the table? > > > > > > > > -Matt Cheah > > > > > > > > From: Ryan Blue > > Reply-To: "rb...@netflix.com" > > Date: Tuesday, February 26, 2019 at 4:40 PM > > To: Matt Cheah > > Cc: Sean Owen , Wenchen Fan , > Xiao Li , Matei Zaharia , > Spark Dev List > > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > > > > > Thanks for bumping this, Matt. I think we can have the discussion here > to clarify exactly what we’re committing to and then have a vote thread > once we’re agreed. > > Getting back to the DSv2 discussion, I think we have a good handle on > what would be added: > > · Plugin system for catalogs > > > > · TableCatalog interface (I’ll start a vote thread for this SPIP > shortly) > > > > · TableCatalog implementation backed by SessionCatalog that can > load v2 tables > > > > · Resolution rule to load v2 tables using the new catalog > > > > · CTAS logical and physical plan nodes > > > > · Conversions from SQL parsed logical plans to v2 logical plans > > > > Initially, this will always use the v2 catalog backed by SessionCatalog > to avoid dependence on the multi-catalog work. All of those are already > implemented and working, so I think it is reasonable that we can get them > in. > > Then we can consider a few stretch goals: > > · Get in as much DDL as we can. I think create and drop table > should be easy. > > > > · Multi-catalog identifier parsing and multi-catalog support > > > > If we get those last two in, it would be great. We can make the call > closer to release time. Does anyone want to change this set of work? > > > > > > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: > > > > What would then be the next steps we'd take to collectively decide on > plans and timelines moving forward? Might I suggest scheduling a conference > call with appropriate PMCs to put our ideas together? Maybe such a > discussion can take place at next week's meeting? Or do we need to have a > separate formalized voting thread which is guided by a PMC? > > > > My suggestion is to try to make concrete steps forward and to avoid > letting this slip through the cracks. > > > > I also think there would be merits to having a project plan and > estimates around how long each of the features we want to complete is going > to take to implement and review. > > > > -Matt Cheah > > > > On 2/24/19, 3:05 PM, "Sean Owen" wrote: > > > > Sure, I don't read anyone making these statements though? Let's > assume > > good intent, that "foo should happen" as "my opinion as a member of > > the community, which is not solely up to me, is that foo should > > happen". I unde
Re: [DISCUSS] Spark 3.0 and DataSourceV2
To add to this, we can add a stable interface anytime if the original one was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of APIs that were experimental in 2.0 and then got stabilized in later 2.x releases for example. Matei > On Feb 26, 2019, at 5:12 PM, Reynold Xin wrote: > > We will have to fix that before we declare dev2 is stable, because > InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. > > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > Will that then require an API break down the line? Do we save that for Spark > 4? > > > > > -Matt Cheah? > > > > From: Ryan Blue > Reply-To: "rb...@netflix.com" > Date: Tuesday, February 26, 2019 at 4:53 PM > To: Matt Cheah > Cc: Sean Owen , Wenchen Fan , Xiao Li > , Matei Zaharia , Spark Dev > List > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > That's a good question. > > > > While I'd love to have a solution for that, I don't think it is a good idea > to delay DSv2 until we have one. That is going to require a lot of internal > changes and I don't see how we could make the release date if we are > including an InternalRow replacement. > > > > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: > > Reynold made a note earlier about a proper Row API that isn’t InternalRow – > is that still on the table? > > > > -Matt Cheah > > > > From: Ryan Blue > Reply-To: "rb...@netflix.com" > Date: Tuesday, February 26, 2019 at 4:40 PM > To: Matt Cheah > Cc: Sean Owen , Wenchen Fan , Xiao Li > , Matei Zaharia , Spark Dev > List > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > Thanks for bumping this, Matt. I think we can have the discussion here to > clarify exactly what we’re committing to and then have a vote thread once > we’re agreed. > Getting back to the DSv2 discussion, I think we have a good handle on what > would be added: > · Plugin system for catalogs > > · TableCatalog interface (I’ll start a vote thread for this SPIP > shortly) > > · TableCatalog implementation backed by SessionCatalog that can load > v2 tables > > · Resolution rule to load v2 tables using the new catalog > > · CTAS logical and physical plan nodes > > · Conversions from SQL parsed logical plans to v2 logical plans > > Initially, this will always use the v2 catalog backed by SessionCatalog to > avoid dependence on the multi-catalog work. All of those are already > implemented and working, so I think it is reasonable that we can get them in. > Then we can consider a few stretch goals: > · Get in as much DDL as we can. I think create and drop table should > be easy. > > · Multi-catalog identifier parsing and multi-catalog support > > If we get those last two in, it would be great. We can make the call closer > to release time. Does anyone want to change this set of work? > > > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: > > What would then be the next steps we'd take to collectively decide on plans > and timelines moving forward? Might I suggest scheduling a conference call > with appropriate PMCs to put our ideas together? Maybe such a discussion can > take place at next week's meeting? Or do we need to have a separate > formalized voting thread which is guided by a PMC? > > My suggestion is to try to make concrete steps forward and to avoid letting > this slip through the cracks. > > I also think there would be merits to having a project plan and estimates > around how long each of the features we want to complete is going to take to > implement and review. > > -Matt Cheah > > On 2/24/19, 3:05 PM, "Sean Owen" wrote: > > Sure, I don't read anyone making these statements though? Let's assume > good intent, that "foo should happen" as "my opinion as a member of > the community, which is not solely up to me, is that foo should > happen". I understand it's possible for a person to make their opinion > over-weighted; this whole style of decision making assumes good actors > and doesn't optimize against bad ones. Not that it can't happen, just > not seeing it here. > > I have never seen any vote on a feature list, by a PMC or otherwise. > We can do that if really needed I guess. But that also isn't the > authoritative process in play here, in contrast. > > If there's not a more specific subtext or issue here, which is fine to > say (on private@ if it's sensitive or something), yes, let's move
Re: [DISCUSS] Spark 3.0 and DataSourceV2
We will have to fix that before we declare dev2 is stable, because InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > Will that then require an API break down the line? Do we save that for > Spark 4? > > > > -Matt Cheah? > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Tuesday, February 26, 2019 at 4:53 PM > *To: *Matt Cheah > *Cc: *Sean Owen , Wenchen Fan , > Xiao Li , Matei Zaharia , > Spark Dev List > *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > That's a good question. > > > > While I'd love to have a solution for that, I don't think it is a good > idea to delay DSv2 until we have one. That is going to require a lot of > internal changes and I don't see how we could make the release date if we > are including an InternalRow replacement. > > > > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: > > Reynold made a note earlier about a proper Row API that isn’t InternalRow > – is that still on the table? > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Tuesday, February 26, 2019 at 4:40 PM > *To: *Matt Cheah > *Cc: *Sean Owen , Wenchen Fan , > Xiao Li , Matei Zaharia , > Spark Dev List > *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > Thanks for bumping this, Matt. I think we can have the discussion here to > clarify exactly what we’re committing to and then have a vote thread once > we’re agreed. > > Getting back to the DSv2 discussion, I think we have a good handle on what > would be added: > > · Plugin system for catalogs > > · TableCatalog interface (I’ll start a vote thread for this SPIP > shortly) > > · TableCatalog implementation backed by SessionCatalog that can > load v2 tables > > · Resolution rule to load v2 tables using the new catalog > > · CTAS logical and physical plan nodes > > · Conversions from SQL parsed logical plans to v2 logical plans > > Initially, this will always use the v2 catalog backed by SessionCatalog to > avoid dependence on the multi-catalog work. All of those are already > implemented and working, so I think it is reasonable that we can get them > in. > > Then we can consider a few stretch goals: > > · Get in as much DDL as we can. I think create and drop table > should be easy. > > · Multi-catalog identifier parsing and multi-catalog support > > If we get those last two in, it would be great. We can make the call > closer to release time. Does anyone want to change this set of work? > > > > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: > > What would then be the next steps we'd take to collectively decide on > plans and timelines moving forward? Might I suggest scheduling a conference > call with appropriate PMCs to put our ideas together? Maybe such a > discussion can take place at next week's meeting? Or do we need to have a > separate formalized voting thread which is guided by a PMC? > > My suggestion is to try to make concrete steps forward and to avoid > letting this slip through the cracks. > > I also think there would be merits to having a project plan and estimates > around how long each of the features we want to complete is going to take > to implement and review. > > -Matt Cheah > > On 2/24/19, 3:05 PM, "Sean Owen" wrote: > > Sure, I don't read anyone making these statements though? Let's assume > good intent, that "foo should happen" as "my opinion as a member of > the community, which is not solely up to me, is that foo should > happen". I understand it's possible for a person to make their opinion > over-weighted; this whole style of decision making assumes good actors > and doesn't optimize against bad ones. Not that it can't happen, just > not seeing it here. > > I have never seen any vote on a feature list, by a PMC or otherwise. > We can do that if really needed I guess. But that also isn't the > authoritative process in play here, in contrast. > > If there's not a more specific subtext or issue here, which is fine to > say (on private@ if it's sensitive or something), yes, let's move on > in good faith. > > On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra > wrote: > > There is nothing wrong with individuals advocating for what they > think should or should not be in Spark 3.0, nor should anyone shy away from > explaining why they think delaying the release for some reason is or isn't > a good idea. What is a pr
Re: [DISCUSS] Spark 3.0 and DataSourceV2
Will that then require an API break down the line? Do we save that for Spark 4? -Matt Cheah? From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:53 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark Dev List Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 That's a good question. While I'd love to have a solution for that, I don't think it is a good idea to delay DSv2 until we have one. That is going to require a lot of internal changes and I don't see how we could make the release date if we are including an InternalRow replacement. On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: Reynold made a note earlier about a proper Row API that isn’t InternalRow – is that still on the table? -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:40 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark Dev List Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 Thanks for bumping this, Matt. I think we can have the discussion here to clarify exactly what we’re committing to and then have a vote thread once we’re agreed. Getting back to the DSv2 discussion, I think we have a good handle on what would be added: · Plugin system for catalogs · TableCatalog interface (I’ll start a vote thread for this SPIP shortly) · TableCatalog implementation backed by SessionCatalog that can load v2 tables · Resolution rule to load v2 tables using the new catalog · CTAS logical and physical plan nodes · Conversions from SQL parsed logical plans to v2 logical plans Initially, this will always use the v2 catalog backed by SessionCatalog to avoid dependence on the multi-catalog work. All of those are already implemented and working, so I think it is reasonable that we can get them in. Then we can consider a few stretch goals: · Get in as much DDL as we can. I think create and drop table should be easy. · Multi-catalog identifier parsing and multi-catalog support If we get those last two in, it would be great. We can make the call closer to release time. Does anyone want to change this set of work? On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: What would then be the next steps we'd take to collectively decide on plans and timelines moving forward? Might I suggest scheduling a conference call with appropriate PMCs to put our ideas together? Maybe such a discussion can take place at next week's meeting? Or do we need to have a separate formalized voting thread which is guided by a PMC? My suggestion is to try to make concrete steps forward and to avoid letting this slip through the cracks. I also think there would be merits to having a project plan and estimates around how long each of the features we want to complete is going to take to implement and review. -Matt Cheah On 2/24/19, 3:05 PM, "Sean Owen" wrote: Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, which is not solely up to me, is that foo should happen". I understand it's possible for a person to make their opinion over-weighted; this whole style of decision making assumes good actors and doesn't optimize against bad ones. Not that it can't happen, just not seeing it here. I have never seen any vote on a feature list, by a PMC or otherwise. We can do that if really needed I guess. But that also isn't the authoritative process in play here, in contrast. If there's not a more specific subtext or issue here, which is fine to say (on private@ if it's sensitive or something), yes, let's move on in good faith. On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra wrote: > There is nothing wrong with individuals advocating for what they think should or should not be in Spark 3.0, nor should anyone shy away from explaining why they think delaying the release for some reason is or isn't a good idea. What is a problem, or is at least something that I have a problem with, are declarative, pseudo-authoritative statements that 3.0 (or some other release) will or won't contain some feature, API, etc. or that some issue is or is not blocker or worth delaying for. When the PMC has not voted on such issues, I'm often left thinking, "Wait... what? Who decided that, or where did that decision come from?" -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix smime.p7s Description: S/MIME cryptographic signature
Re: [DISCUSS] Spark 3.0 and DataSourceV2
That's a good question. While I'd love to have a solution for that, I don't think it is a good idea to delay DSv2 until we have one. That is going to require a lot of internal changes and I don't see how we could make the release date if we are including an InternalRow replacement. On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: > Reynold made a note earlier about a proper Row API that isn’t InternalRow > – is that still on the table? > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Tuesday, February 26, 2019 at 4:40 PM > *To: *Matt Cheah > *Cc: *Sean Owen , Wenchen Fan , > Xiao Li , Matei Zaharia , > Spark Dev List > *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > Thanks for bumping this, Matt. I think we can have the discussion here to > clarify exactly what we’re committing to and then have a vote thread once > we’re agreed. > > Getting back to the DSv2 discussion, I think we have a good handle on what > would be added: > > · Plugin system for catalogs > > · TableCatalog interface (I’ll start a vote thread for this SPIP > shortly) > > · TableCatalog implementation backed by SessionCatalog that can > load v2 tables > > · Resolution rule to load v2 tables using the new catalog > > · CTAS logical and physical plan nodes > > · Conversions from SQL parsed logical plans to v2 logical plans > > Initially, this will always use the v2 catalog backed by SessionCatalog to > avoid dependence on the multi-catalog work. All of those are already > implemented and working, so I think it is reasonable that we can get them > in. > > Then we can consider a few stretch goals: > > · Get in as much DDL as we can. I think create and drop table > should be easy. > > · Multi-catalog identifier parsing and multi-catalog support > > If we get those last two in, it would be great. We can make the call > closer to release time. Does anyone want to change this set of work? > > > > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: > > What would then be the next steps we'd take to collectively decide on > plans and timelines moving forward? Might I suggest scheduling a conference > call with appropriate PMCs to put our ideas together? Maybe such a > discussion can take place at next week's meeting? Or do we need to have a > separate formalized voting thread which is guided by a PMC? > > My suggestion is to try to make concrete steps forward and to avoid > letting this slip through the cracks. > > I also think there would be merits to having a project plan and estimates > around how long each of the features we want to complete is going to take > to implement and review. > > -Matt Cheah > > On 2/24/19, 3:05 PM, "Sean Owen" wrote: > > Sure, I don't read anyone making these statements though? Let's assume > good intent, that "foo should happen" as "my opinion as a member of > the community, which is not solely up to me, is that foo should > happen". I understand it's possible for a person to make their opinion > over-weighted; this whole style of decision making assumes good actors > and doesn't optimize against bad ones. Not that it can't happen, just > not seeing it here. > > I have never seen any vote on a feature list, by a PMC or otherwise. > We can do that if really needed I guess. But that also isn't the > authoritative process in play here, in contrast. > > If there's not a more specific subtext or issue here, which is fine to > say (on private@ if it's sensitive or something), yes, let's move on > in good faith. > > On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra > wrote: > > There is nothing wrong with individuals advocating for what they > think should or should not be in Spark 3.0, nor should anyone shy away from > explaining why they think delaying the release for some reason is or isn't > a good idea. What is a problem, or is at least something that I have a > problem with, are declarative, pseudo-authoritative statements that 3.0 (or > some other release) will or won't contain some feature, API, etc. or that > some issue is or is not blocker or worth delaying for. When the PMC has not > voted on such issues, I'm often left thinking, "Wait... what? Who decided > that, or where did that decision come from?" > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix
Re: [DISCUSS] Spark 3.0 and DataSourceV2
Thanks for bumping this, Matt. I think we can have the discussion here to clarify exactly what we’re committing to and then have a vote thread once we’re agreed. Getting back to the DSv2 discussion, I think we have a good handle on what would be added: - Plugin system for catalogs - TableCatalog interface (I’ll start a vote thread for this SPIP shortly) - TableCatalog implementation backed by SessionCatalog that can load v2 tables - Resolution rule to load v2 tables using the new catalog - CTAS logical and physical plan nodes - Conversions from SQL parsed logical plans to v2 logical plans Initially, this will always use the v2 catalog backed by SessionCatalog to avoid dependence on the multi-catalog work. All of those are already implemented and working, so I think it is reasonable that we can get them in. Then we can consider a few stretch goals: - Get in as much DDL as we can. I think create and drop table should be easy. - Multi-catalog identifier parsing and multi-catalog support If we get those last two in, it would be great. We can make the call closer to release time. Does anyone want to change this set of work? On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: > What would then be the next steps we'd take to collectively decide on > plans and timelines moving forward? Might I suggest scheduling a conference > call with appropriate PMCs to put our ideas together? Maybe such a > discussion can take place at next week's meeting? Or do we need to have a > separate formalized voting thread which is guided by a PMC? > > My suggestion is to try to make concrete steps forward and to avoid > letting this slip through the cracks. > > I also think there would be merits to having a project plan and estimates > around how long each of the features we want to complete is going to take > to implement and review. > > -Matt Cheah > > On 2/24/19, 3:05 PM, "Sean Owen" wrote: > > Sure, I don't read anyone making these statements though? Let's assume > good intent, that "foo should happen" as "my opinion as a member of > the community, which is not solely up to me, is that foo should > happen". I understand it's possible for a person to make their opinion > over-weighted; this whole style of decision making assumes good actors > and doesn't optimize against bad ones. Not that it can't happen, just > not seeing it here. > > I have never seen any vote on a feature list, by a PMC or otherwise. > We can do that if really needed I guess. But that also isn't the > authoritative process in play here, in contrast. > > If there's not a more specific subtext or issue here, which is fine to > say (on private@ if it's sensitive or something), yes, let's move on > in good faith. > > On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra > wrote: > > There is nothing wrong with individuals advocating for what they > think should or should not be in Spark 3.0, nor should anyone shy away from > explaining why they think delaying the release for some reason is or isn't > a good idea. What is a problem, or is at least something that I have a > problem with, are declarative, pseudo-authoritative statements that 3.0 (or > some other release) will or won't contain some feature, API, etc. or that > some issue is or is not blocker or worth delaying for. When the PMC has not > voted on such issues, I'm often left thinking, "Wait... what? Who decided > that, or where did that decision come from?" > > -- Ryan Blue Software Engineer Netflix
Re: [DISCUSS] Spark 3.0 and DataSourceV2
Reynold made a note earlier about a proper Row API that isn’t InternalRow – is that still on the table? -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:40 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark Dev List Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 Thanks for bumping this, Matt. I think we can have the discussion here to clarify exactly what we’re committing to and then have a vote thread once we’re agreed. Getting back to the DSv2 discussion, I think we have a good handle on what would be added: · Plugin system for catalogs · TableCatalog interface (I’ll start a vote thread for this SPIP shortly) · TableCatalog implementation backed by SessionCatalog that can load v2 tables · Resolution rule to load v2 tables using the new catalog · CTAS logical and physical plan nodes · Conversions from SQL parsed logical plans to v2 logical plans Initially, this will always use the v2 catalog backed by SessionCatalog to avoid dependence on the multi-catalog work. All of those are already implemented and working, so I think it is reasonable that we can get them in. Then we can consider a few stretch goals: · Get in as much DDL as we can. I think create and drop table should be easy. · Multi-catalog identifier parsing and multi-catalog support If we get those last two in, it would be great. We can make the call closer to release time. Does anyone want to change this set of work? On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah wrote: What would then be the next steps we'd take to collectively decide on plans and timelines moving forward? Might I suggest scheduling a conference call with appropriate PMCs to put our ideas together? Maybe such a discussion can take place at next week's meeting? Or do we need to have a separate formalized voting thread which is guided by a PMC? My suggestion is to try to make concrete steps forward and to avoid letting this slip through the cracks. I also think there would be merits to having a project plan and estimates around how long each of the features we want to complete is going to take to implement and review. -Matt Cheah On 2/24/19, 3:05 PM, "Sean Owen" wrote: Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, which is not solely up to me, is that foo should happen". I understand it's possible for a person to make their opinion over-weighted; this whole style of decision making assumes good actors and doesn't optimize against bad ones. Not that it can't happen, just not seeing it here. I have never seen any vote on a feature list, by a PMC or otherwise. We can do that if really needed I guess. But that also isn't the authoritative process in play here, in contrast. If there's not a more specific subtext or issue here, which is fine to say (on private@ if it's sensitive or something), yes, let's move on in good faith. On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra wrote: > There is nothing wrong with individuals advocating for what they think should or should not be in Spark 3.0, nor should anyone shy away from explaining why they think delaying the release for some reason is or isn't a good idea. What is a problem, or is at least something that I have a problem with, are declarative, pseudo-authoritative statements that 3.0 (or some other release) will or won't contain some feature, API, etc. or that some issue is or is not blocker or worth delaying for. When the PMC has not voted on such issues, I'm often left thinking, "Wait... what? Who decided that, or where did that decision come from?" -- Ryan Blue Software Engineer Netflix smime.p7s Description: S/MIME cryptographic signature
Re: [DISCUSS] Spark 3.0 and DataSourceV2
What would then be the next steps we'd take to collectively decide on plans and timelines moving forward? Might I suggest scheduling a conference call with appropriate PMCs to put our ideas together? Maybe such a discussion can take place at next week's meeting? Or do we need to have a separate formalized voting thread which is guided by a PMC? My suggestion is to try to make concrete steps forward and to avoid letting this slip through the cracks. I also think there would be merits to having a project plan and estimates around how long each of the features we want to complete is going to take to implement and review. -Matt Cheah On 2/24/19, 3:05 PM, "Sean Owen" wrote: Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, which is not solely up to me, is that foo should happen". I understand it's possible for a person to make their opinion over-weighted; this whole style of decision making assumes good actors and doesn't optimize against bad ones. Not that it can't happen, just not seeing it here. I have never seen any vote on a feature list, by a PMC or otherwise. We can do that if really needed I guess. But that also isn't the authoritative process in play here, in contrast. If there's not a more specific subtext or issue here, which is fine to say (on private@ if it's sensitive or something), yes, let's move on in good faith. On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra wrote: > There is nothing wrong with individuals advocating for what they think should or should not be in Spark 3.0, nor should anyone shy away from explaining why they think delaying the release for some reason is or isn't a good idea. What is a problem, or is at least something that I have a problem with, are declarative, pseudo-authoritative statements that 3.0 (or some other release) will or won't contain some feature, API, etc. or that some issue is or is not blocker or worth delaying for. When the PMC has not voted on such issues, I'm often left thinking, "Wait... what? Who decided that, or where did that decision come from?" smime.p7s Description: S/MIME cryptographic signature
Re: [DISCUSS] Spark 3.0 and DataSourceV2
Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, which is not solely up to me, is that foo should happen". I understand it's possible for a person to make their opinion over-weighted; this whole style of decision making assumes good actors and doesn't optimize against bad ones. Not that it can't happen, just not seeing it here. I have never seen any vote on a feature list, by a PMC or otherwise. We can do that if really needed I guess. But that also isn't the authoritative process in play here, in contrast. If there's not a more specific subtext or issue here, which is fine to say (on private@ if it's sensitive or something), yes, let's move on in good faith. On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra wrote: > There is nothing wrong with individuals advocating for what they think should > or should not be in Spark 3.0, nor should anyone shy away from explaining why > they think delaying the release for some reason is or isn't a good idea. What > is a problem, or is at least something that I have a problem with, are > declarative, pseudo-authoritative statements that 3.0 (or some other release) > will or won't contain some feature, API, etc. or that some issue is or is not > blocker or worth delaying for. When the PMC has not voted on such issues, I'm > often left thinking, "Wait... what? Who decided that, or where did that > decision come from?" - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Spark 3.0 and DataSourceV2
> > I’m not quite sure what you mean here. > I'll try to explain once more, then I'll drop it since continuing the rest of the discussion in this thread is more important than getting side-tracked. There is nothing wrong with individuals advocating for what they think should or should not be in Spark 3.0, nor should anyone shy away from explaining why they think delaying the release for some reason is or isn't a good idea. What is a problem, or is at least something that I have a problem with, are declarative, pseudo-authoritative statements that 3.0 (or some other release) will or won't contain some feature, API, etc. or that some issue is or is not blocker or worth delaying for. When the PMC has not voted on such issues, I'm often left thinking, "Wait... what? Who decided that, or where did that decision come from?" On Sun, Feb 24, 2019 at 1:27 PM Ryan Blue wrote: > Thanks to Matt for his philosophical take. I agree. > > The intent is to set a common goal, so that we work toward getting v2 in a > usable state as a community. Part of that is making choices to get it done > on time, which we have already seen on this thread: setting out more > clearly what we mean by “DSv2” and what we think we can get done on time. > > I don’t mean to say that we should commit to a plan that *requires* a > delay to the next release (which describes the goal better than 3.0 does). > But we should commit to making sure the goal is met, acknowledging that > this is one of the most important efforts for many people that work in this > community. > > I think it would help to clarify what this commitment means, at least to > me: > >1. What it means: the community will seriously consider delaying the >next release if this isn’t done by our initial deadline. >2. What it does not mean: delaying the release no matter what happens. > > In that event that this feature isn’t done on time, it would be up to the > community to decide what to do. But in the mean time, I think it is healthy > to set a goal and work toward it. (I am not making a distinction between > PMC and community here.) > > I think this commitment is a good idea for the same reason why we set > other goals: to hold ourselves accountable. When one sets a New Years > resolution to drop 10 pounds, it isn’t that the hope or intent wasn’t there > before. It is about having a (self-imposed) constraint that helps you make > hard choices: cake now or meet my goal? > > Spark 3.0 has many other major features as well, delaying the release has > significant cost and we should try our best to not let it happen.” > > I agree with Wenchen here. No one wants to actually delay the release. We > just want to push ourselves to make some tough decisions, using that delay > as a motivating factor. > > The fact that some entity other than the PMC thinks that Spark 3.0 should > contain certain new features or that it will be costly to them if 3.0 does > not contain those features is not dispositive. > > I’m not quite sure what you mean here. While I am representing my > employer, I am bringing up this topic as a member of the community, to > suggest a direction for the community to take, and I fully accept that the > decision is up to the community. I think it is reasonable to candidly state > how this matters; that context informs the discussion. > > On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra > wrote: > >> To your other message: I already see a number of PMC members here. Who's >>> the other entity? >>> >> >> I'll answer indirectly since pointing fingers isn't really my intent. In >> the absence of a PMC vote, I react negatively to individuals making new >> declarative policy statements or statements to the effect that Spark >> 3.0 will (or will not) include these features..., or that it will be too >> costly to do something. Maybe these are innocent shorthand that leave off a >> clarifying "in my opinion" or "according to the current state of JIRA" or >> some such. >> >> My points are simply that nobody other than the PMC has an authoritative >> say on such matters, and if we are at a point where the community needs >> some definitive guidance, then we need PMC involvement and a vote. That's >> not intended to preclude or terminate community discussion, because that >> is, indeed, lovely to see. >> >> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen wrote: >> >>> To your other message: I already see a number of PMC members here. Who's >>> the other entity? The PMC is the thing that says a thing is a release, >>> sure, but this discussion is properly a community one. And here we are, >>> this is lovely to see. >>> >>> (May I remind everyone to casually, sometime, browse the large list of >>> other JIRAs targeted for Spark 3? it's much more than DSv2!) >>> >>> I can't speak to specific decisions here, but, I see: >>> >>> Spark 3 doesn't have a release date. Notionally it's 6 months after >>> Spark 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. >>> Can we
Re: [DISCUSS] Spark 3.0 and DataSourceV2
Thanks to Matt for his philosophical take. I agree. The intent is to set a common goal, so that we work toward getting v2 in a usable state as a community. Part of that is making choices to get it done on time, which we have already seen on this thread: setting out more clearly what we mean by “DSv2” and what we think we can get done on time. I don’t mean to say that we should commit to a plan that *requires* a delay to the next release (which describes the goal better than 3.0 does). But we should commit to making sure the goal is met, acknowledging that this is one of the most important efforts for many people that work in this community. I think it would help to clarify what this commitment means, at least to me: 1. What it means: the community will seriously consider delaying the next release if this isn’t done by our initial deadline. 2. What it does not mean: delaying the release no matter what happens. In that event that this feature isn’t done on time, it would be up to the community to decide what to do. But in the mean time, I think it is healthy to set a goal and work toward it. (I am not making a distinction between PMC and community here.) I think this commitment is a good idea for the same reason why we set other goals: to hold ourselves accountable. When one sets a New Years resolution to drop 10 pounds, it isn’t that the hope or intent wasn’t there before. It is about having a (self-imposed) constraint that helps you make hard choices: cake now or meet my goal? Spark 3.0 has many other major features as well, delaying the release has significant cost and we should try our best to not let it happen.” I agree with Wenchen here. No one wants to actually delay the release. We just want to push ourselves to make some tough decisions, using that delay as a motivating factor. The fact that some entity other than the PMC thinks that Spark 3.0 should contain certain new features or that it will be costly to them if 3.0 does not contain those features is not dispositive. I’m not quite sure what you mean here. While I am representing my employer, I am bringing up this topic as a member of the community, to suggest a direction for the community to take, and I fully accept that the decision is up to the community. I think it is reasonable to candidly state how this matters; that context informs the discussion. On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra wrote: > To your other message: I already see a number of PMC members here. Who's >> the other entity? >> > > I'll answer indirectly since pointing fingers isn't really my intent. In > the absence of a PMC vote, I react negatively to individuals making new > declarative policy statements or statements to the effect that Spark > 3.0 will (or will not) include these features..., or that it will be too > costly to do something. Maybe these are innocent shorthand that leave off a > clarifying "in my opinion" or "according to the current state of JIRA" or > some such. > > My points are simply that nobody other than the PMC has an authoritative > say on such matters, and if we are at a point where the community needs > some definitive guidance, then we need PMC involvement and a vote. That's > not intended to preclude or terminate community discussion, because that > is, indeed, lovely to see. > > On Fri, Feb 22, 2019 at 12:04 PM Sean Owen wrote: > >> To your other message: I already see a number of PMC members here. Who's >> the other entity? The PMC is the thing that says a thing is a release, >> sure, but this discussion is properly a community one. And here we are, >> this is lovely to see. >> >> (May I remind everyone to casually, sometime, browse the large list of >> other JIRAs targeted for Spark 3? it's much more than DSv2!) >> >> I can't speak to specific decisions here, but, I see: >> >> Spark 3 doesn't have a release date. Notionally it's 6 months after Spark >> 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we >> throw out... June 2019, and I update the website? It can slip but that >> gives a concrete timeframe around which to plan. What can comfortably get >> in by June 2019? >> >> Agreement that "DSv2" is going into Spark 3, for some definition of DSv2 >> that's probably roughly Matt's list. >> >> Changes that can't go into a minor release (API changes, etc) must by >> definition go into Spark 3.0. Agree those first and do those now. Delay >> Spark 3 until they're done and prioritize accordingly. >> Changes that can go into a minor release can go into 3.1, if needed. >> This has been in discussion long enough that I think whatever design(s) >> are on the table for DSv2 now are as close as one is going to get. The >> perfect is the enemy of the good. >> >> Aside from throwing out a date, I probably just restated what everyone >> said. But I was 'summoned' :) >> >> On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra >> wrote: >> >>> However, as other people mentioned, Spark 3.0 has many other major
Re: [DISCUSS] Spark 3.0 and DataSourceV2
> > To your other message: I already see a number of PMC members here. Who's > the other entity? > I'll answer indirectly since pointing fingers isn't really my intent. In the absence of a PMC vote, I react negatively to individuals making new declarative policy statements or statements to the effect that Spark 3.0 will (or will not) include these features..., or that it will be too costly to do something. Maybe these are innocent shorthand that leave off a clarifying "in my opinion" or "according to the current state of JIRA" or some such. My points are simply that nobody other than the PMC has an authoritative say on such matters, and if we are at a point where the community needs some definitive guidance, then we need PMC involvement and a vote. That's not intended to preclude or terminate community discussion, because that is, indeed, lovely to see. On Fri, Feb 22, 2019 at 12:04 PM Sean Owen wrote: > To your other message: I already see a number of PMC members here. Who's > the other entity? The PMC is the thing that says a thing is a release, > sure, but this discussion is properly a community one. And here we are, > this is lovely to see. > > (May I remind everyone to casually, sometime, browse the large list of > other JIRAs targeted for Spark 3? it's much more than DSv2!) > > I can't speak to specific decisions here, but, I see: > > Spark 3 doesn't have a release date. Notionally it's 6 months after Spark > 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we > throw out... June 2019, and I update the website? It can slip but that > gives a concrete timeframe around which to plan. What can comfortably get > in by June 2019? > > Agreement that "DSv2" is going into Spark 3, for some definition of DSv2 > that's probably roughly Matt's list. > > Changes that can't go into a minor release (API changes, etc) must by > definition go into Spark 3.0. Agree those first and do those now. Delay > Spark 3 until they're done and prioritize accordingly. > Changes that can go into a minor release can go into 3.1, if needed. > This has been in discussion long enough that I think whatever design(s) > are on the table for DSv2 now are as close as one is going to get. The > perfect is the enemy of the good. > > Aside from throwing out a date, I probably just restated what everyone > said. But I was 'summoned' :) > > On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra > wrote: > >> However, as other people mentioned, Spark 3.0 has many other major >>> features as well >>> >> >> I fundamentally disagree. First, Spark 3.0 has nothing until the PMC says >> it has something, and we have made no commitment along the lines that >> "Spark 3.0.0 will not be released unless it contains new features x, y and >> z." Second, major-version releases are not about adding new features. >> Major-version releases are about making changes to the public API that we >> cannot make in feature or bug-fix releases. If that is all that is >> accomplished in a particular major release, that's fine -- in fact, we >> quite intentionally did not target new features in the Spark 2.0.0 release. >> The fact that some entity other than the PMC thinks that Spark 3.0 should >> contain certain new features or that it will be costly to them if 3.0 does >> not contain those features is not dispositive. If there are public API >> changes that should occur in a timely fashion and there is also a list of >> new features that some users or contributors want to see in 3.0 but that >> look likely to not be ready in a timely fashion, then the PMC should fully >> consider releasing 3.0 without all those new features. There is no reason >> that they can't come in with 3.1.0. >> >
Re: [DISCUSS] Spark 3.0 and DataSourceV2
To your other message: I already see a number of PMC members here. Who's the other entity? The PMC is the thing that says a thing is a release, sure, but this discussion is properly a community one. And here we are, this is lovely to see. (May I remind everyone to casually, sometime, browse the large list of other JIRAs targeted for Spark 3? it's much more than DSv2!) I can't speak to specific decisions here, but, I see: Spark 3 doesn't have a release date. Notionally it's 6 months after Spark 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we throw out... June 2019, and I update the website? It can slip but that gives a concrete timeframe around which to plan. What can comfortably get in by June 2019? Agreement that "DSv2" is going into Spark 3, for some definition of DSv2 that's probably roughly Matt's list. Changes that can't go into a minor release (API changes, etc) must by definition go into Spark 3.0. Agree those first and do those now. Delay Spark 3 until they're done and prioritize accordingly. Changes that can go into a minor release can go into 3.1, if needed. This has been in discussion long enough that I think whatever design(s) are on the table for DSv2 now are as close as one is going to get. The perfect is the enemy of the good. Aside from throwing out a date, I probably just restated what everyone said. But I was 'summoned' :) On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra wrote: > However, as other people mentioned, Spark 3.0 has many other major >> features as well >> > > I fundamentally disagree. First, Spark 3.0 has nothing until the PMC says > it has something, and we have made no commitment along the lines that > "Spark 3.0.0 will not be released unless it contains new features x, y and > z." Second, major-version releases are not about adding new features. > Major-version releases are about making changes to the public API that we > cannot make in feature or bug-fix releases. If that is all that is > accomplished in a particular major release, that's fine -- in fact, we > quite intentionally did not target new features in the Spark 2.0.0 release. > The fact that some entity other than the PMC thinks that Spark 3.0 should > contain certain new features or that it will be costly to them if 3.0 does > not contain those features is not dispositive. If there are public API > changes that should occur in a timely fashion and there is also a list of > new features that some users or contributors want to see in 3.0 but that > look likely to not be ready in a timely fashion, then the PMC should fully > consider releasing 3.0 without all those new features. There is no reason > that they can't come in with 3.1.0. >
Re: [DISCUSS] Spark 3.0 and DataSourceV2
In addition to logical plans, we need SQL support. That requires resolving v2 tables from a catalog and a few other changes like separating v1 plans from SQL parsing (see the earlier dev list thread). I’d also like to add DDL operations for v2. I think it also makes sense to add a new DF write API, as we discussed in the sync as well. That way, users have an API to start moving to that always uses the v2 plans and behavior. Here are all the commands that we have implemented on top of the proposed table catalog API. We should be able to get these working in upstream Spark fairly quickly. - CREATE TABLE [IF NOT EXISTS] … - CREATE TABLE … PARTITIONED BY … - CREATE TABLE … AS SELECT … - CREATE TABLE LIKE - ALTER TABLE … - ADD COLUMNS … - DROP COLUMNS … - ALTER COLUMN … TYPE - ALTER COLUMN … COMMENT - RENAME COLUMN … TO … - SET TBLPROPERTIES … - UNSET TBLPROPERTIES … - ALTER TABLE … RENAME TO … - DROP TABLE [IF EXISTS] … - DESCRIBE [FORMATTED|EXTENDED] … - SHOW CREATE TABLE … - SHOW TBLPROPERTIES - ALTER TABLE - REFRESH TABLE … - INSERT INTO … - INSERT OVERWRITE … - DELETE FROM … WHERE … On Thu, Feb 21, 2019 at 3:57 PM Matt Cheah wrote: > To evaluate the amount of work required to get Data Source V2 into Spark > 3.0, we should have a list of all the specific SPIPs and patches that are > pending that would constitute a successful and usable revamp of that API. > Here are the ones I could find and know off the top of my head: > >1. Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252 > 1. In my opinion this is by far the most important API to get in, > but it’s also the most important API to give thorough thought and > evaluation. >2. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: >https://issues.apache.org/jira/browse/SPARK-24923 + >https://issues.apache.org/jira/browse/SPARK-24253 >3. Catalogs for other entities, such as functions. Pluggable system >for loading these. >4. Multi-Catalog support - >https://issues.apache.org/jira/browse/SPARK-25006 >5. Migration of existing sources to V2, particularly file sources like >Parquet and ORC – requires #1 as discussed in yesterday’s meeting > > > > Can someone add to this list if we’re missing anything? It might also make > sense to either assigned a JIRA label or to update JIRA umbrella issues if > any. Whatever mechanism works for being able to find all of these > outstanding issues in one place. > > > > My understanding is that #1 is the most critical feature we need, and the > feature that will go a long way towards allowing everything else to fall > into place. #2 is also critical for external implementations of Data Source > V2. I think we can afford to defer 3-5 to a future point release. But #1 > and #2 are also the features that have remained open for the longest time > and we really need to move forward on these. Putting a target release for > 3.0 will help in that regard. > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Thursday, February 21, 2019 at 2:22 PM > *To: *Matei Zaharia > *Cc: *Spark Dev List > *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > I'm all for making releases more often if we want. But this work could > really use a target release to motivate getting it done. If we agree that > it will block a release, then everyone is motivated to review and get the > PRs in. > > > > If this work doesn't make it in the 3.0 release, I'm not confident that it > will get done. Maybe we can have a release shortly after, but the timeline > for these features -- that many of us need -- is nearly creeping into > years. That's when alternatives start looking more likely to deliver. I'd > rather see this work get in so we don't have to consider those > alternatives, which is why I think this commitment is a good idea. > > > > I also would like to see multi-catalog support, but that is more > reasonable to put off for a follow-up feature release, maybe 3.1. > > > > On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia > wrote: > > How large would the delay be? My 2 cents are that there’s nothing stopping > us from making feature releases more often if we want to, so we shouldn’t > see this as an “either delay 3.0 or release in >6 months” decision. If the > work is likely to get in with a small delay and simplifies our work after > 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. > But if it would be a large delay, we should also weigh it against other > things that are going to get delayed if 3.0 moves much later. > > It might also be better to p
Re: [DISCUSS] Spark 3.0 and DataSourceV2
To evaluate the amount of work required to get Data Source V2 into Spark 3.0, we should have a list of all the specific SPIPs and patches that are pending that would constitute a successful and usable revamp of that API. Here are the ones I could find and know off the top of my head: Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252 In my opinion this is by far the most important API to get in, but it’s also the most important API to give thorough thought and evaluation. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: https://issues.apache.org/jira/browse/SPARK-24923 + https://issues.apache.org/jira/browse/SPARK-24253 Catalogs for other entities, such as functions. Pluggable system for loading these. Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006 Migration of existing sources to V2, particularly file sources like Parquet and ORC – requires #1 as discussed in yesterday’s meeting Can someone add to this list if we’re missing anything? It might also make sense to either assigned a JIRA label or to update JIRA umbrella issues if any. Whatever mechanism works for being able to find all of these outstanding issues in one place. My understanding is that #1 is the most critical feature we need, and the feature that will go a long way towards allowing everything else to fall into place. #2 is also critical for external implementations of Data Source V2. I think we can afford to defer 3-5 to a future point release. But #1 and #2 are also the features that have remained open for the longest time and we really need to move forward on these. Putting a target release for 3.0 will help in that regard. -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, February 21, 2019 at 2:22 PM To: Matei Zaharia Cc: Spark Dev List Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident that it will get done. Maybe we can have a release shortly after, but the timeline for these features -- that many of us need -- is nearly creeping into years. That's when alternatives start looking more likely to deliver. I'd rather see this work get in so we don't have to consider those alternatives, which is why I think this commitment is a good idea. I also would like to see multi-catalog support, but that is more reasonable to put off for a follow-up feature release, maybe 3.1. On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia wrote: How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work after 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. But if it would be a large delay, we should also weigh it against other things that are going to get delayed if 3.0 moves much later. It might also be better to propose a specific date to delay until, so people can still plan around when the release branch will likely be cut. Matei > On Feb 21, 2019, at 1:03 PM, Ryan Blue wrote: > > Hi everyone, > > In the DSv2 sync last night, we had a discussion about roadmap and what the > goal should be for getting the main features into Spark. We all agreed that > 3.0 should be that goal, even if it means delaying the 3.0 release. > > The possibility of delaying the 3.0 release may be controversial, so I want > to bring it up to the dev list to build consensus around it. The rationale > for this is partly that much of this work has been outstanding for more than > a year now. If it doesn't make it into 3.0, then it would be another 6 months > before it would be in a release, and would be nearing 2 years to get the work > done. > > Are there any objections to targeting 3.0 for this? > > In addition, much of the planning for multi-catalog support has been done to > make v2 possible. Do we also want to include multi-catalog support? > > > rb > > -- > Ryan Blue > Software Engineer > Netflix -- Ryan Blue Software Engineer Netflix smime.p7s Description: S/MIME cryptographic signature
Re: [DISCUSS] Spark 3.0 and DataSourceV2
I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident that it will get done. Maybe we can have a release shortly after, but the timeline for these features -- that many of us need -- is nearly creeping into years. That's when alternatives start looking more likely to deliver. I'd rather see this work get in so we don't have to consider those alternatives, which is why I think this commitment is a good idea. I also would like to see multi-catalog support, but that is more reasonable to put off for a follow-up feature release, maybe 3.1. On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia wrote: > How large would the delay be? My 2 cents are that there’s nothing stopping > us from making feature releases more often if we want to, so we shouldn’t > see this as an “either delay 3.0 or release in >6 months” decision. If the > work is likely to get in with a small delay and simplifies our work after > 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. > But if it would be a large delay, we should also weigh it against other > things that are going to get delayed if 3.0 moves much later. > > It might also be better to propose a specific date to delay until, so > people can still plan around when the release branch will likely be cut. > > Matei > > > On Feb 21, 2019, at 1:03 PM, Ryan Blue > wrote: > > > > Hi everyone, > > > > In the DSv2 sync last night, we had a discussion about roadmap and what > the goal should be for getting the main features into Spark. We all agreed > that 3.0 should be that goal, even if it means delaying the 3.0 release. > > > > The possibility of delaying the 3.0 release may be controversial, so I > want to bring it up to the dev list to build consensus around it. The > rationale for this is partly that much of this work has been outstanding > for more than a year now. If it doesn't make it into 3.0, then it would be > another 6 months before it would be in a release, and would be nearing 2 > years to get the work done. > > > > Are there any objections to targeting 3.0 for this? > > > > In addition, much of the planning for multi-catalog support has been > done to make v2 possible. Do we also want to include multi-catalog support? > > > > > > rb > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix
Re: [DISCUSS] Spark 3.0 and DataSourceV2
How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work after 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. But if it would be a large delay, we should also weigh it against other things that are going to get delayed if 3.0 moves much later. It might also be better to propose a specific date to delay until, so people can still plan around when the release branch will likely be cut. Matei > On Feb 21, 2019, at 1:03 PM, Ryan Blue wrote: > > Hi everyone, > > In the DSv2 sync last night, we had a discussion about roadmap and what the > goal should be for getting the main features into Spark. We all agreed that > 3.0 should be that goal, even if it means delaying the 3.0 release. > > The possibility of delaying the 3.0 release may be controversial, so I want > to bring it up to the dev list to build consensus around it. The rationale > for this is partly that much of this work has been outstanding for more than > a year now. If it doesn't make it into 3.0, then it would be another 6 months > before it would be in a release, and would be nearing 2 years to get the work > done. > > Are there any objections to targeting 3.0 for this? > > In addition, much of the planning for multi-catalog support has been done to > make v2 possible. Do we also want to include multi-catalog support? > > > rb > > -- > Ryan Blue > Software Engineer > Netflix - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[DISCUSS] Spark 3.0 and DataSourceV2
Hi everyone, In the DSv2 sync last night, we had a discussion about roadmap and what the goal should be for getting the main features into Spark. We all agreed that 3.0 should be that goal, even if it means delaying the 3.0 release. The possibility of delaying the 3.0 release may be controversial, so I want to bring it up to the dev list to build consensus around it. The rationale for this is partly that much of this work has been outstanding for more than a year now. If it doesn't make it into 3.0, then it would be another 6 months before it would be in a release, and would be nearing 2 years to get the work done. Are there any objections to targeting 3.0 for this? In addition, much of the planning for multi-catalog support has been done to make v2 possible. Do we also want to include multi-catalog support? rb -- Ryan Blue Software Engineer Netflix