Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Rui Wang
Yes. The current goal is a pure educational deprecation.

So given the proposal:
1. existing users or users who do not care about catalog names in table
identifiers can still use all the API that maintain their past behavior.
2. new users who intend to use table identifiers with catalog names
get warned by the annotation (and maybe a bit more comments on the API
surface) that 4 API will not serve their usage.

I believe this proposal is conservative: do not intend to cause troubles
for existing users; do not intend to force user migration; do not intend to
delete APIs; do not intend to hurt supportability. If there is anything
that we can make this goal clear, I can do it.

Ultimately, the 4 API in this thread has the problem that it is not
compatible with 3 layer namespace thus not the same as other API who is
supporting 3 layer namespace. For people who want to include catalog names,
the problem itself will stand and we probably have to do something about
it.

-Rui

On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan  wrote:

> It's better to keep all APIs working. But in this case, I really have no
> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
> String, tableName: String) currently checks if table "dbName.tableName"
> exists in the Hive metastore, and does not work with v2 catalogs at all.
> It's not only a "not needed" API, but also a confusing API. We need a
> mechanism to move users away from confusing APIs.
>
> I agree that we should not abuse deprecation. I think a general principle
> to use deprecation is you have the intention to remove it eventually, which
> is exactly the case here. We should remove these 4 APIs when most users
> have moved away.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for starting the official discussion, Rui.
>>
>> 'Unneeded API' doesn't sound like a good frame for this discussion
>> because it ignores the existing users and codes completely.
>> Technically, the above mentioned reasons look irrelevant to any
>> specific existing bugs or future maintenance cost saving. Instead, the
>> deprecation already causes costs to the community (your PR, the future
>> migration guide, and the communication with the customers like Q&A)
>> and to the users for the actual migration to new API and validations.
>> Given that, for now, the goal of this proposal looks like a pure
>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>
>> Can we be more conservative at Apache Spark deprecation and allow
>> users to use both APIs freely without any concern of uncertain
>> insupportability? I simply want to avoid the situation where the pure
>> educational deprecation itself becomes `Unneeded Deprecation` in the
>> community.
>>
>> Dongjoon.
>>
>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>> >
>> > I want to highlight in case I missed this in the original email:
>> >
>> > The 4 API will not be deleted. They will just be marked as deprecated
>> annotations and we encourage users to use their alternatives.
>> >
>> >
>> > -Rui
>> >
>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>> >>
>> >> Hi Community,
>> >>
>> >> Proposal:
>> >> I want to discuss a proposal to deprecate the following Catalog API:
>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>> >> def getTable(dbName: String, tableName: String): Table
>> >> def getFunction(dbName: String, functionName: String): Function
>> >> def tableExists(dbName: String, tableName: String): Boolean
>> >>
>> >>
>> >> Context:
>> >> We have been adding table identifier with catalog name (aka 3 layer
>> namespace) support to Catalog API in
>> https://issues.apache.org/jira/browse/SPARK-39235.
>> >> The basic idea is, if an API accepts:
>> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes
>> analyzer which treats a as catalog name, b namespace name and c table name.
>> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
>> which treats a as catalog name, b namespace name.
>> >> Meanwhile we still maintain the backwards compatibility for such API
>> to make sure past behavior remains the same. E.g. If you only use tableName
>> it is still recognized by the session catalog.
>> >>
>> >> With this effort ongoing, the above 4 API becomes not fully compatible
>> with the 3 layer namespace.
>> >>
>> >> use tableExists(dbName: String, tableName: String) as an example,
>> given that it takes two parameters but leaves no room for the extra catalog
>> name. Also if we want to reuse the two parameters, which one will be the
>> one that takes more than one name part?
>> >>
>> >>
>> >> How?
>> >> So how to improve the above 4 API? There are two options:
>> >> a. Expand those four API to let those API accept catalog names. For
>> example, tableExists(catalogName: String, dbName: String, tableName:
>> String).
>> >> b. mark those API as `deprecated`.
>> >>
>> >> I am proposing to follow option B which does API de

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan
It's better to keep all APIs working. But in this case, I really have no
idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
String, tableName: String) currently checks if table "dbName.tableName"
exists in the Hive metastore, and does not work with v2 catalogs at all.
It's not only a "not needed" API, but also a confusing API. We need a
mechanism to move users away from confusing APIs.

I agree that we should not abuse deprecation. I think a general principle
to use deprecation is you have the intention to remove it eventually, which
is exactly the case here. We should remove these 4 APIs when most users
have moved away.

Thanks,
Wenchen

On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
wrote:

> Thank you for starting the official discussion, Rui.
>
> 'Unneeded API' doesn't sound like a good frame for this discussion
> because it ignores the existing users and codes completely.
> Technically, the above mentioned reasons look irrelevant to any
> specific existing bugs or future maintenance cost saving. Instead, the
> deprecation already causes costs to the community (your PR, the future
> migration guide, and the communication with the customers like Q&A)
> and to the users for the actual migration to new API and validations.
> Given that, for now, the goal of this proposal looks like a pure
> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>
> Can we be more conservative at Apache Spark deprecation and allow
> users to use both APIs freely without any concern of uncertain
> insupportability? I simply want to avoid the situation where the pure
> educational deprecation itself becomes `Unneeded Deprecation` in the
> community.
>
> Dongjoon.
>
> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
> >
> > I want to highlight in case I missed this in the original email:
> >
> > The 4 API will not be deleted. They will just be marked as deprecated
> annotations and we encourage users to use their alternatives.
> >
> >
> > -Rui
> >
> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
> >>
> >> Hi Community,
> >>
> >> Proposal:
> >> I want to discuss a proposal to deprecate the following Catalog API:
> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
> >> def getTable(dbName: String, tableName: String): Table
> >> def getFunction(dbName: String, functionName: String): Function
> >> def tableExists(dbName: String, tableName: String): Boolean
> >>
> >>
> >> Context:
> >> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> >> The basic idea is, if an API accepts:
> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes analyzer
> which treats a as catalog name, b namespace name and c table name.
> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
> which treats a as catalog name, b namespace name.
> >> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
> >>
> >> With this effort ongoing, the above 4 API becomes not fully compatible
> with the 3 layer namespace.
> >>
> >> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
> >>
> >>
> >> How?
> >> So how to improve the above 4 API? There are two options:
> >> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> >> b. mark those API as `deprecated`.
> >>
> >> I am proposing to follow option B which does API deprecation.
> >>
> >> Why?
> >> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> >> 2. Reduce incomplete API. The proposed API to deprecate does not
> support 3 layer namespace now, and it is hard to do so (where to take 3
> part names)?
> >> 3. Deprecation suggests users to migrate their usage on API.
> >> 4. There was existing practice that we deprecated CreateExternalTable
> API when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
> >>
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> Rui Wang
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Adam Binford
Dang I was hoping it was the second one. In our case the data is too large
to run the whole backfill for the aggregation in a single batch (the
shuffle is too big). We currently resort to manually batching (i.e. not
streaming) the backlog (anything older than the watermark) when we need to
reprocess, because we can't really know for sure our batches are processed
in the correct event time order when starting from scratch.

I'm not against deprecating Trigger.Once, just wanted to chime in that
someone was using it! I'm itching to upgrade and try out the new stuff.

Adam

On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim 
wrote:

> Thanks for the input, Adam! Replying inline.
>
> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford  wrote:
>
>> We use Trigger.Once a lot, usually for backfilling data for new streams.
>> I feel like I could see a continuing use case for "ignore trigger limits
>> for this batch" (ignoring the whole issue with re-running the last failed
>> batch vs a new batch), but we haven't actually been able to upgrade yet and
>> try out Trigger.AvailableNow, so that could end up replacing all our use
>> cases.
>>
>> One question I did have is how it does (or is supposed to) handle
>> watermarking. Is the watermark determined for each batch independently like
>> a normal stream, or is it kept constant for all batches in a single
>> AvailableNow run? For example, we have a stateful job that we need to rerun
>> occasionally, and it takes ~6 batches to backfill all the data before
>> catching up to live data. With a Trigger.Once we know we won't accidentally
>> drop any data due to the watermark when backfilling, because it's a single
>> batch with no watermark yet. Would the same hold true if we backfill with
>> Trigger.AvailableNow instead?
>>
>
> The behavior is the former one. Each batch advances the watermark and it's
> immediately reflected on the next batch.
>
> The number of batches Trigger.AvailableNow will execute depends on the
> data source and the source option. For example, if you use Kafka data
> source and use Trigger.AvailableNow without specifying any source option on
> limiting the size, Trigger.AvailableNow will process all newly available
> data as a single microbatch. It may not be still a single microbatch - it
> would also handle the batch already logged in WAL first if any, as well as
> handle no-data batch after the run of all microbatches. But I guess these
> additional batches wouldn't hurt your case.
>
> If the data source doesn't allow processing all available data within a
> single microbatch (depending on the implementation of default read limit),
> you could probably either 1) set source options regarding to limit size as
> an unrealistic one to enforce a single batch or 2) set the delay of
> watermark as an unrealistic one. Both of the workarounds require you to use
> different source options/watermark configuration for backfill vs normal run
> - I agree it wouldn’t be a smooth one.
>
> This proposal does not aim to remove Trigger.Once in near future. As long
> as we deprecate Trigger.Once, we would get some reports for use cases
> Trigger.Once may work better (like your case) for the time period across
> several minor releases, and then we can really decide. (IMHO, handling
> backfill with Trigger.Once sounds to me as a workaround. Backfill may
> warrant its own design to deal with.)
>
>
>>
>> Adam
>>
>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim 
>> wrote:
>>
>>> Bump to get a chance to expose the proposal to wider audiences.
>>>
>>> Given that there are not many active contributors/maintainers in area
>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>>> avoid being stuck. I'll give a final reminder early next week, and move
>>> forward if there are no outstanding objections.
>>>
>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi dev,

 I would like to hear voices about deprecating Trigger.Once, and
 promoting Trigger.AvailableNow as a replacement [1] in Structured 
 Streaming.
 (It doesn't mean we remove Trigger.Once now or near future. It probably
 requires another discussion at some time.)

 Rationalization:

 The expected behavior of Trigger.Once is like reading all available
 data after the last trigger and processing them. This holds true when the
 last run was gracefully terminated, but there are cases streaming queries
 to not be terminated gracefully. There is a possibility the last run may
 write the offset for the new batch before termination, then a new run of
 Trigger.Once only processes the data which was built in the latest
 unfinished batch and doesn't process new data.

 The behavior is not deterministic from the users' point of view, as end
 users wouldn't know whether the last run wrote the offset or not, unless
 they look into the query's checkpoint by themselves.

 While Trigger.Availabl

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
Thanks for the input, Adam! Replying inline.

On Fri, Jul 8, 2022 at 8:48 PM Adam Binford  wrote:

> We use Trigger.Once a lot, usually for backfilling data for new streams. I
> feel like I could see a continuing use case for "ignore trigger limits for
> this batch" (ignoring the whole issue with re-running the last failed batch
> vs a new batch), but we haven't actually been able to upgrade yet and try
> out Trigger.AvailableNow, so that could end up replacing all our use cases.
>
> One question I did have is how it does (or is supposed to) handle
> watermarking. Is the watermark determined for each batch independently like
> a normal stream, or is it kept constant for all batches in a single
> AvailableNow run? For example, we have a stateful job that we need to rerun
> occasionally, and it takes ~6 batches to backfill all the data before
> catching up to live data. With a Trigger.Once we know we won't accidentally
> drop any data due to the watermark when backfilling, because it's a single
> batch with no watermark yet. Would the same hold true if we backfill with
> Trigger.AvailableNow instead?
>

The behavior is the former one. Each batch advances the watermark and it's
immediately reflected on the next batch.

The number of batches Trigger.AvailableNow will execute depends on the data
source and the source option. For example, if you use Kafka data source and
use Trigger.AvailableNow without specifying any source option on limiting
the size, Trigger.AvailableNow will process all newly available data as a
single microbatch. It may not be still a single microbatch - it would also
handle the batch already logged in WAL first if any, as well as
handle no-data batch after the run of all microbatches. But I guess these
additional batches wouldn't hurt your case.

If the data source doesn't allow processing all available data within a
single microbatch (depending on the implementation of default read limit),
you could probably either 1) set source options regarding to limit size as
an unrealistic one to enforce a single batch or 2) set the delay of
watermark as an unrealistic one. Both of the workarounds require you to use
different source options/watermark configuration for backfill vs normal run
- I agree it wouldn’t be a smooth one.

This proposal does not aim to remove Trigger.Once in near future. As long
as we deprecate Trigger.Once, we would get some reports for use cases
Trigger.Once may work better (like your case) for the time period across
several minor releases, and then we can really decide. (IMHO, handling
backfill with Trigger.Once sounds to me as a workaround. Backfill may
warrant its own design to deal with.)


>
> Adam
>
> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim 
> wrote:
>
>> Bump to get a chance to expose the proposal to wider audiences.
>>
>> Given that there are not many active contributors/maintainers in area
>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>> avoid being stuck. I'll give a final reminder early next week, and move
>> forward if there are no outstanding objections.
>>
>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim 
>> wrote:
>>
>>> Hi dev,
>>>
>>> I would like to hear voices about deprecating Trigger.Once, and
>>> promoting Trigger.AvailableNow as a replacement [1] in Structured Streaming.
>>> (It doesn't mean we remove Trigger.Once now or near future. It probably
>>> requires another discussion at some time.)
>>>
>>> Rationalization:
>>>
>>> The expected behavior of Trigger.Once is like reading all available data
>>> after the last trigger and processing them. This holds true when the last
>>> run was gracefully terminated, but there are cases streaming queries to not
>>> be terminated gracefully. There is a possibility the last run may write the
>>> offset for the new batch before termination, then a new run of Trigger.Once
>>> only processes the data which was built in the latest unfinished batch and
>>> doesn't process new data.
>>>
>>> The behavior is not deterministic from the users' point of view, as end
>>> users wouldn't know whether the last run wrote the offset or not, unless
>>> they look into the query's checkpoint by themselves.
>>>
>>> While Trigger.AvailableNow came to solve the scalability issue on
>>> Trigger.Once, it also ensures that it tries to process all available data
>>> at the point of time it is triggered, which consistently works as expected
>>> behavior of Trigger.Once.
>>>
>>> Another issue on Trigger.Once is that it does not trigger a no-data
>>> batch immediately. When the watermark is calculated in batch N, it takes
>>> effect in batch N + 1. If the query is scheduled to be run per day, you can
>>> see the output from the new watermark in the query run the next day. Thanks
>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well
>>> before termination of the query.
>>>
>>> Please review and let us know if you have any feedback or concerns on
>>> the proposal.
>>>
>>> Thanks!
>>> Jungtae

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Adam Binford
We use Trigger.Once a lot, usually for backfilling data for new streams. I
feel like I could see a continuing use case for "ignore trigger limits for
this batch" (ignoring the whole issue with re-running the last failed batch
vs a new batch), but we haven't actually been able to upgrade yet and try
out Trigger.AvailableNow, so that could end up replacing all our use cases.

One question I did have is how it does (or is supposed to) handle
watermarking. Is the watermark determined for each batch independently like
a normal stream, or is it kept constant for all batches in a single
AvailableNow run? For example, we have a stateful job that we need to rerun
occasionally, and it takes ~6 batches to backfill all the data before
catching up to live data. With a Trigger.Once we know we won't accidentally
drop any data due to the watermark when backfilling, because it's a single
batch with no watermark yet. Would the same hold true if we backfill with
Trigger.AvailableNow instead?

Adam

On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim 
wrote:

> Bump to get a chance to expose the proposal to wider audiences.
>
> Given that there are not many active contributors/maintainers in area
> Structured Streaming, I'd consider the discussion as "lazy consensus" to
> avoid being stuck. I'll give a final reminder early next week, and move
> forward if there are no outstanding objections.
>
> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I would like to hear voices about deprecating Trigger.Once, and promoting
>> Trigger.AvailableNow as a replacement [1] in Structured Streaming.
>> (It doesn't mean we remove Trigger.Once now or near future. It probably
>> requires another discussion at some time.)
>>
>> Rationalization:
>>
>> The expected behavior of Trigger.Once is like reading all available data
>> after the last trigger and processing them. This holds true when the last
>> run was gracefully terminated, but there are cases streaming queries to not
>> be terminated gracefully. There is a possibility the last run may write the
>> offset for the new batch before termination, then a new run of Trigger.Once
>> only processes the data which was built in the latest unfinished batch and
>> doesn't process new data.
>>
>> The behavior is not deterministic from the users' point of view, as end
>> users wouldn't know whether the last run wrote the offset or not, unless
>> they look into the query's checkpoint by themselves.
>>
>> While Trigger.AvailableNow came to solve the scalability issue on
>> Trigger.Once, it also ensures that it tries to process all available data
>> at the point of time it is triggered, which consistently works as expected
>> behavior of Trigger.Once.
>>
>> Another issue on Trigger.Once is that it does not trigger a no-data batch
>> immediately. When the watermark is calculated in batch N, it takes effect
>> in batch N + 1. If the query is scheduled to be run per day, you can see
>> the output from the new watermark in the query run the next day. Thanks to
>> the behavior of Trigger.AvailableNow, it handles no-data batch as well
>> before termination of the query.
>>
>> Please review and let us know if you have any feedback or concerns on the
>> proposal.
>>
>> Thanks!
>> Jungtaek Lim
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>
>

-- 
Adam Binford


Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
Bump to get a chance to expose the proposal to wider audiences.

Given that there are not many active contributors/maintainers in area
Structured Streaming, I'd consider the discussion as "lazy consensus" to
avoid being stuck. I'll give a final reminder early next week, and move
forward if there are no outstanding objections.

On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I would like to hear voices about deprecating Trigger.Once, and promoting
> Trigger.AvailableNow as a replacement [1] in Structured Streaming.
> (It doesn't mean we remove Trigger.Once now or near future. It probably
> requires another discussion at some time.)
>
> Rationalization:
>
> The expected behavior of Trigger.Once is like reading all available data
> after the last trigger and processing them. This holds true when the last
> run was gracefully terminated, but there are cases streaming queries to not
> be terminated gracefully. There is a possibility the last run may write the
> offset for the new batch before termination, then a new run of Trigger.Once
> only processes the data which was built in the latest unfinished batch and
> doesn't process new data.
>
> The behavior is not deterministic from the users' point of view, as end
> users wouldn't know whether the last run wrote the offset or not, unless
> they look into the query's checkpoint by themselves.
>
> While Trigger.AvailableNow came to solve the scalability issue on
> Trigger.Once, it also ensures that it tries to process all available data
> at the point of time it is triggered, which consistently works as expected
> behavior of Trigger.Once.
>
> Another issue on Trigger.Once is that it does not trigger a no-data batch
> immediately. When the watermark is calculated in batch N, it takes effect
> in batch N + 1. If the query is scheduled to be run per day, you can see
> the output from the new watermark in the query run the next day. Thanks to
> the behavior of Trigger.AvailableNow, it handles no-data batch as well
> before termination of the query.
>
> Please review and let us know if you have any feedback or concerns on the
> proposal.
>
> Thanks!
> Jungtaek Lim
>
> 1. https://issues.apache.org/jira/browse/SPARK-36533
>


Re: Apache Spark 3.2.2 Release?

2022-07-08 Thread Dongjoon Hyun
Thank you so much! :)

Dongjoon.

On Thu, Jul 7, 2022 at 6:51 PM Joshua Rosen  wrote:
>
> +1; thanks for coordinating this!
>
> I have a few more correctness bugs to add to the list in your original email 
> (these were originally missing the 'correctness' JIRA label):
>
> - https://issues.apache.org/jira/browse/SPARK-37643 : when 
> charVarcharAsString is true, char datatype partition table query incorrect
> - https://issues.apache.org/jira/browse/SPARK-37865 : Spark should not dedup 
> the groupingExpressions when the first child of Union has duplicate columns
> - https://issues.apache.org/jira/browse/SPARK-38787 : Possible correctness 
> issue on stream-stream join when handling edge case
>
>
> On Thu, Jul 7, 2022 at 6:12 PM Dongjoon Hyun  wrote:
>>
>> Thank you all.
>>
>> I'll check and prepare RC1 for next week.
>>
>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org