Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)
Yes. The current goal is a pure educational deprecation. So given the proposal: 1. existing users or users who do not care about catalog names in table identifiers can still use all the API that maintain their past behavior. 2. new users who intend to use table identifiers with catalog names get warned by the annotation (and maybe a bit more comments on the API surface) that 4 API will not serve their usage. I believe this proposal is conservative: do not intend to cause troubles for existing users; do not intend to force user migration; do not intend to delete APIs; do not intend to hurt supportability. If there is anything that we can make this goal clear, I can do it. Ultimately, the 4 API in this thread has the problem that it is not compatible with 3 layer namespace thus not the same as other API who is supporting 3 layer namespace. For people who want to include catalog names, the problem itself will stand and we probably have to do something about it. -Rui On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan wrote: > It's better to keep all APIs working. But in this case, I really have no > idea how to make these 4 APIs reasonable. For example, tableExists(dbName: > String, tableName: String) currently checks if table "dbName.tableName" > exists in the Hive metastore, and does not work with v2 catalogs at all. > It's not only a "not needed" API, but also a confusing API. We need a > mechanism to move users away from confusing APIs. > > I agree that we should not abuse deprecation. I think a general principle > to use deprecation is you have the intention to remove it eventually, which > is exactly the case here. We should remove these 4 APIs when most users > have moved away. > > Thanks, > Wenchen > > On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun > wrote: > >> Thank you for starting the official discussion, Rui. >> >> 'Unneeded API' doesn't sound like a good frame for this discussion >> because it ignores the existing users and codes completely. >> Technically, the above mentioned reasons look irrelevant to any >> specific existing bugs or future maintenance cost saving. Instead, the >> deprecation already causes costs to the community (your PR, the future >> migration guide, and the communication with the customers like Q&A) >> and to the users for the actual migration to new API and validations. >> Given that, for now, the goal of this proposal looks like a pure >> educational purpose to advertise new APIs to Apache Spark 3.4+ users. >> >> Can we be more conservative at Apache Spark deprecation and allow >> users to use both APIs freely without any concern of uncertain >> insupportability? I simply want to avoid the situation where the pure >> educational deprecation itself becomes `Unneeded Deprecation` in the >> community. >> >> Dongjoon. >> >> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang wrote: >> > >> > I want to highlight in case I missed this in the original email: >> > >> > The 4 API will not be deleted. They will just be marked as deprecated >> annotations and we encourage users to use their alternatives. >> > >> > >> > -Rui >> > >> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang wrote: >> >> >> >> Hi Community, >> >> >> >> Proposal: >> >> I want to discuss a proposal to deprecate the following Catalog API: >> >> def listColumns(dbName: String, tableName: String): Dataset[Column] >> >> def getTable(dbName: String, tableName: String): Table >> >> def getFunction(dbName: String, functionName: String): Function >> >> def tableExists(dbName: String, tableName: String): Boolean >> >> >> >> >> >> Context: >> >> We have been adding table identifier with catalog name (aka 3 layer >> namespace) support to Catalog API in >> https://issues.apache.org/jira/browse/SPARK-39235. >> >> The basic idea is, if an API accepts: >> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes >> analyzer which treats a as catalog name, b namespace name and c table name. >> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer >> which treats a as catalog name, b namespace name. >> >> Meanwhile we still maintain the backwards compatibility for such API >> to make sure past behavior remains the same. E.g. If you only use tableName >> it is still recognized by the session catalog. >> >> >> >> With this effort ongoing, the above 4 API becomes not fully compatible >> with the 3 layer namespace. >> >> >> >> use tableExists(dbName: String, tableName: String) as an example, >> given that it takes two parameters but leaves no room for the extra catalog >> name. Also if we want to reuse the two parameters, which one will be the >> one that takes more than one name part? >> >> >> >> >> >> How? >> >> So how to improve the above 4 API? There are two options: >> >> a. Expand those four API to let those API accept catalog names. For >> example, tableExists(catalogName: String, dbName: String, tableName: >> String). >> >> b. mark those API as `deprecated`. >> >> >> >> I am proposing to follow option B which does API de
Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)
It's better to keep all APIs working. But in this case, I really have no idea how to make these 4 APIs reasonable. For example, tableExists(dbName: String, tableName: String) currently checks if table "dbName.tableName" exists in the Hive metastore, and does not work with v2 catalogs at all. It's not only a "not needed" API, but also a confusing API. We need a mechanism to move users away from confusing APIs. I agree that we should not abuse deprecation. I think a general principle to use deprecation is you have the intention to remove it eventually, which is exactly the case here. We should remove these 4 APIs when most users have moved away. Thanks, Wenchen On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun wrote: > Thank you for starting the official discussion, Rui. > > 'Unneeded API' doesn't sound like a good frame for this discussion > because it ignores the existing users and codes completely. > Technically, the above mentioned reasons look irrelevant to any > specific existing bugs or future maintenance cost saving. Instead, the > deprecation already causes costs to the community (your PR, the future > migration guide, and the communication with the customers like Q&A) > and to the users for the actual migration to new API and validations. > Given that, for now, the goal of this proposal looks like a pure > educational purpose to advertise new APIs to Apache Spark 3.4+ users. > > Can we be more conservative at Apache Spark deprecation and allow > users to use both APIs freely without any concern of uncertain > insupportability? I simply want to avoid the situation where the pure > educational deprecation itself becomes `Unneeded Deprecation` in the > community. > > Dongjoon. > > On Thu, Jul 7, 2022 at 2:26 PM Rui Wang wrote: > > > > I want to highlight in case I missed this in the original email: > > > > The 4 API will not be deleted. They will just be marked as deprecated > annotations and we encourage users to use their alternatives. > > > > > > -Rui > > > > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang wrote: > >> > >> Hi Community, > >> > >> Proposal: > >> I want to discuss a proposal to deprecate the following Catalog API: > >> def listColumns(dbName: String, tableName: String): Dataset[Column] > >> def getTable(dbName: String, tableName: String): Table > >> def getFunction(dbName: String, functionName: String): Function > >> def tableExists(dbName: String, tableName: String): Boolean > >> > >> > >> Context: > >> We have been adding table identifier with catalog name (aka 3 layer > namespace) support to Catalog API in > https://issues.apache.org/jira/browse/SPARK-39235. > >> The basic idea is, if an API accepts: > >> 1. only tableName:String, we allow it accepts "a.b.c" and goes analyzer > which treats a as catalog name, b namespace name and c table name. > >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer > which treats a as catalog name, b namespace name. > >> Meanwhile we still maintain the backwards compatibility for such API to > make sure past behavior remains the same. E.g. If you only use tableName it > is still recognized by the session catalog. > >> > >> With this effort ongoing, the above 4 API becomes not fully compatible > with the 3 layer namespace. > >> > >> use tableExists(dbName: String, tableName: String) as an example, given > that it takes two parameters but leaves no room for the extra catalog name. > Also if we want to reuse the two parameters, which one will be the one that > takes more than one name part? > >> > >> > >> How? > >> So how to improve the above 4 API? There are two options: > >> a. Expand those four API to let those API accept catalog names. For > example, tableExists(catalogName: String, dbName: String, tableName: > String). > >> b. mark those API as `deprecated`. > >> > >> I am proposing to follow option B which does API deprecation. > >> > >> Why? > >> 1. Reduce unneeded API. The existing API can support the same behavior > given SPARK-39235. For example, tableExists(dbName, tableName) can be > replaced to use tableExists("dbName.tableName"). > >> 2. Reduce incomplete API. The proposed API to deprecate does not > support 3 layer namespace now, and it is hard to do so (where to take 3 > part names)? > >> 3. Deprecation suggests users to migrate their usage on API. > >> 4. There was existing practice that we deprecated CreateExternalTable > API when adding CreateTable API: > https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220 > >> > >> > >> What do you think? > >> > >> Thanks, > >> Rui Wang > >> > >> > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow
Dang I was hoping it was the second one. In our case the data is too large to run the whole backfill for the aggregation in a single batch (the shuffle is too big). We currently resort to manually batching (i.e. not streaming) the backlog (anything older than the watermark) when we need to reprocess, because we can't really know for sure our batches are processed in the correct event time order when starting from scratch. I'm not against deprecating Trigger.Once, just wanted to chime in that someone was using it! I'm itching to upgrade and try out the new stuff. Adam On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim wrote: > Thanks for the input, Adam! Replying inline. > > On Fri, Jul 8, 2022 at 8:48 PM Adam Binford wrote: > >> We use Trigger.Once a lot, usually for backfilling data for new streams. >> I feel like I could see a continuing use case for "ignore trigger limits >> for this batch" (ignoring the whole issue with re-running the last failed >> batch vs a new batch), but we haven't actually been able to upgrade yet and >> try out Trigger.AvailableNow, so that could end up replacing all our use >> cases. >> >> One question I did have is how it does (or is supposed to) handle >> watermarking. Is the watermark determined for each batch independently like >> a normal stream, or is it kept constant for all batches in a single >> AvailableNow run? For example, we have a stateful job that we need to rerun >> occasionally, and it takes ~6 batches to backfill all the data before >> catching up to live data. With a Trigger.Once we know we won't accidentally >> drop any data due to the watermark when backfilling, because it's a single >> batch with no watermark yet. Would the same hold true if we backfill with >> Trigger.AvailableNow instead? >> > > The behavior is the former one. Each batch advances the watermark and it's > immediately reflected on the next batch. > > The number of batches Trigger.AvailableNow will execute depends on the > data source and the source option. For example, if you use Kafka data > source and use Trigger.AvailableNow without specifying any source option on > limiting the size, Trigger.AvailableNow will process all newly available > data as a single microbatch. It may not be still a single microbatch - it > would also handle the batch already logged in WAL first if any, as well as > handle no-data batch after the run of all microbatches. But I guess these > additional batches wouldn't hurt your case. > > If the data source doesn't allow processing all available data within a > single microbatch (depending on the implementation of default read limit), > you could probably either 1) set source options regarding to limit size as > an unrealistic one to enforce a single batch or 2) set the delay of > watermark as an unrealistic one. Both of the workarounds require you to use > different source options/watermark configuration for backfill vs normal run > - I agree it wouldn’t be a smooth one. > > This proposal does not aim to remove Trigger.Once in near future. As long > as we deprecate Trigger.Once, we would get some reports for use cases > Trigger.Once may work better (like your case) for the time period across > several minor releases, and then we can really decide. (IMHO, handling > backfill with Trigger.Once sounds to me as a workaround. Backfill may > warrant its own design to deal with.) > > >> >> Adam >> >> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim >> wrote: >> >>> Bump to get a chance to expose the proposal to wider audiences. >>> >>> Given that there are not many active contributors/maintainers in area >>> Structured Streaming, I'd consider the discussion as "lazy consensus" to >>> avoid being stuck. I'll give a final reminder early next week, and move >>> forward if there are no outstanding objections. >>> >>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Hi dev, I would like to hear voices about deprecating Trigger.Once, and promoting Trigger.AvailableNow as a replacement [1] in Structured Streaming. (It doesn't mean we remove Trigger.Once now or near future. It probably requires another discussion at some time.) Rationalization: The expected behavior of Trigger.Once is like reading all available data after the last trigger and processing them. This holds true when the last run was gracefully terminated, but there are cases streaming queries to not be terminated gracefully. There is a possibility the last run may write the offset for the new batch before termination, then a new run of Trigger.Once only processes the data which was built in the latest unfinished batch and doesn't process new data. The behavior is not deterministic from the users' point of view, as end users wouldn't know whether the last run wrote the offset or not, unless they look into the query's checkpoint by themselves. While Trigger.Availabl
Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow
Thanks for the input, Adam! Replying inline. On Fri, Jul 8, 2022 at 8:48 PM Adam Binford wrote: > We use Trigger.Once a lot, usually for backfilling data for new streams. I > feel like I could see a continuing use case for "ignore trigger limits for > this batch" (ignoring the whole issue with re-running the last failed batch > vs a new batch), but we haven't actually been able to upgrade yet and try > out Trigger.AvailableNow, so that could end up replacing all our use cases. > > One question I did have is how it does (or is supposed to) handle > watermarking. Is the watermark determined for each batch independently like > a normal stream, or is it kept constant for all batches in a single > AvailableNow run? For example, we have a stateful job that we need to rerun > occasionally, and it takes ~6 batches to backfill all the data before > catching up to live data. With a Trigger.Once we know we won't accidentally > drop any data due to the watermark when backfilling, because it's a single > batch with no watermark yet. Would the same hold true if we backfill with > Trigger.AvailableNow instead? > The behavior is the former one. Each batch advances the watermark and it's immediately reflected on the next batch. The number of batches Trigger.AvailableNow will execute depends on the data source and the source option. For example, if you use Kafka data source and use Trigger.AvailableNow without specifying any source option on limiting the size, Trigger.AvailableNow will process all newly available data as a single microbatch. It may not be still a single microbatch - it would also handle the batch already logged in WAL first if any, as well as handle no-data batch after the run of all microbatches. But I guess these additional batches wouldn't hurt your case. If the data source doesn't allow processing all available data within a single microbatch (depending on the implementation of default read limit), you could probably either 1) set source options regarding to limit size as an unrealistic one to enforce a single batch or 2) set the delay of watermark as an unrealistic one. Both of the workarounds require you to use different source options/watermark configuration for backfill vs normal run - I agree it wouldn’t be a smooth one. This proposal does not aim to remove Trigger.Once in near future. As long as we deprecate Trigger.Once, we would get some reports for use cases Trigger.Once may work better (like your case) for the time period across several minor releases, and then we can really decide. (IMHO, handling backfill with Trigger.Once sounds to me as a workaround. Backfill may warrant its own design to deal with.) > > Adam > > On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim > wrote: > >> Bump to get a chance to expose the proposal to wider audiences. >> >> Given that there are not many active contributors/maintainers in area >> Structured Streaming, I'd consider the discussion as "lazy consensus" to >> avoid being stuck. I'll give a final reminder early next week, and move >> forward if there are no outstanding objections. >> >> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim >> wrote: >> >>> Hi dev, >>> >>> I would like to hear voices about deprecating Trigger.Once, and >>> promoting Trigger.AvailableNow as a replacement [1] in Structured Streaming. >>> (It doesn't mean we remove Trigger.Once now or near future. It probably >>> requires another discussion at some time.) >>> >>> Rationalization: >>> >>> The expected behavior of Trigger.Once is like reading all available data >>> after the last trigger and processing them. This holds true when the last >>> run was gracefully terminated, but there are cases streaming queries to not >>> be terminated gracefully. There is a possibility the last run may write the >>> offset for the new batch before termination, then a new run of Trigger.Once >>> only processes the data which was built in the latest unfinished batch and >>> doesn't process new data. >>> >>> The behavior is not deterministic from the users' point of view, as end >>> users wouldn't know whether the last run wrote the offset or not, unless >>> they look into the query's checkpoint by themselves. >>> >>> While Trigger.AvailableNow came to solve the scalability issue on >>> Trigger.Once, it also ensures that it tries to process all available data >>> at the point of time it is triggered, which consistently works as expected >>> behavior of Trigger.Once. >>> >>> Another issue on Trigger.Once is that it does not trigger a no-data >>> batch immediately. When the watermark is calculated in batch N, it takes >>> effect in batch N + 1. If the query is scheduled to be run per day, you can >>> see the output from the new watermark in the query run the next day. Thanks >>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well >>> before termination of the query. >>> >>> Please review and let us know if you have any feedback or concerns on >>> the proposal. >>> >>> Thanks! >>> Jungtae
Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow
We use Trigger.Once a lot, usually for backfilling data for new streams. I feel like I could see a continuing use case for "ignore trigger limits for this batch" (ignoring the whole issue with re-running the last failed batch vs a new batch), but we haven't actually been able to upgrade yet and try out Trigger.AvailableNow, so that could end up replacing all our use cases. One question I did have is how it does (or is supposed to) handle watermarking. Is the watermark determined for each batch independently like a normal stream, or is it kept constant for all batches in a single AvailableNow run? For example, we have a stateful job that we need to rerun occasionally, and it takes ~6 batches to backfill all the data before catching up to live data. With a Trigger.Once we know we won't accidentally drop any data due to the watermark when backfilling, because it's a single batch with no watermark yet. Would the same hold true if we backfill with Trigger.AvailableNow instead? Adam On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim wrote: > Bump to get a chance to expose the proposal to wider audiences. > > Given that there are not many active contributors/maintainers in area > Structured Streaming, I'd consider the discussion as "lazy consensus" to > avoid being stuck. I'll give a final reminder early next week, and move > forward if there are no outstanding objections. > > On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim > wrote: > >> Hi dev, >> >> I would like to hear voices about deprecating Trigger.Once, and promoting >> Trigger.AvailableNow as a replacement [1] in Structured Streaming. >> (It doesn't mean we remove Trigger.Once now or near future. It probably >> requires another discussion at some time.) >> >> Rationalization: >> >> The expected behavior of Trigger.Once is like reading all available data >> after the last trigger and processing them. This holds true when the last >> run was gracefully terminated, but there are cases streaming queries to not >> be terminated gracefully. There is a possibility the last run may write the >> offset for the new batch before termination, then a new run of Trigger.Once >> only processes the data which was built in the latest unfinished batch and >> doesn't process new data. >> >> The behavior is not deterministic from the users' point of view, as end >> users wouldn't know whether the last run wrote the offset or not, unless >> they look into the query's checkpoint by themselves. >> >> While Trigger.AvailableNow came to solve the scalability issue on >> Trigger.Once, it also ensures that it tries to process all available data >> at the point of time it is triggered, which consistently works as expected >> behavior of Trigger.Once. >> >> Another issue on Trigger.Once is that it does not trigger a no-data batch >> immediately. When the watermark is calculated in batch N, it takes effect >> in batch N + 1. If the query is scheduled to be run per day, you can see >> the output from the new watermark in the query run the next day. Thanks to >> the behavior of Trigger.AvailableNow, it handles no-data batch as well >> before termination of the query. >> >> Please review and let us know if you have any feedback or concerns on the >> proposal. >> >> Thanks! >> Jungtaek Lim >> >> 1. https://issues.apache.org/jira/browse/SPARK-36533 >> > -- Adam Binford
Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow
Bump to get a chance to expose the proposal to wider audiences. Given that there are not many active contributors/maintainers in area Structured Streaming, I'd consider the discussion as "lazy consensus" to avoid being stuck. I'll give a final reminder early next week, and move forward if there are no outstanding objections. On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and promoting > Trigger.AvailableNow as a replacement [1] in Structured Streaming. > (It doesn't mean we remove Trigger.Once now or near future. It probably > requires another discussion at some time.) > > Rationalization: > > The expected behavior of Trigger.Once is like reading all available data > after the last trigger and processing them. This holds true when the last > run was gracefully terminated, but there are cases streaming queries to not > be terminated gracefully. There is a possibility the last run may write the > offset for the new batch before termination, then a new run of Trigger.Once > only processes the data which was built in the latest unfinished batch and > doesn't process new data. > > The behavior is not deterministic from the users' point of view, as end > users wouldn't know whether the last run wrote the offset or not, unless > they look into the query's checkpoint by themselves. > > While Trigger.AvailableNow came to solve the scalability issue on > Trigger.Once, it also ensures that it tries to process all available data > at the point of time it is triggered, which consistently works as expected > behavior of Trigger.Once. > > Another issue on Trigger.Once is that it does not trigger a no-data batch > immediately. When the watermark is calculated in batch N, it takes effect > in batch N + 1. If the query is scheduled to be run per day, you can see > the output from the new watermark in the query run the next day. Thanks to > the behavior of Trigger.AvailableNow, it handles no-data batch as well > before termination of the query. > > Please review and let us know if you have any feedback or concerns on the > proposal. > > Thanks! > Jungtaek Lim > > 1. https://issues.apache.org/jira/browse/SPARK-36533 >
Re: Apache Spark 3.2.2 Release?
Thank you so much! :) Dongjoon. On Thu, Jul 7, 2022 at 6:51 PM Joshua Rosen wrote: > > +1; thanks for coordinating this! > > I have a few more correctness bugs to add to the list in your original email > (these were originally missing the 'correctness' JIRA label): > > - https://issues.apache.org/jira/browse/SPARK-37643 : when > charVarcharAsString is true, char datatype partition table query incorrect > - https://issues.apache.org/jira/browse/SPARK-37865 : Spark should not dedup > the groupingExpressions when the first child of Union has duplicate columns > - https://issues.apache.org/jira/browse/SPARK-38787 : Possible correctness > issue on stream-stream join when handling edge case > > > On Thu, Jul 7, 2022 at 6:12 PM Dongjoon Hyun wrote: >> >> Thank you all. >> >> I'll check and prepare RC1 for next week. >> >> Dongjoon. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org