I noticed that the ticket to remove deprecated DataStream#fold() [1] has been created but not yet reach an agreement or assigned.
Actually fold related function and state descriptions has been deprecated for long than 3 years, and I think it's okay to remove such kind of state now. [1] https://issues.apache.org/jira/browse/FLINK-19035 Best, Yun Tang ________________________________ From: Aljoscha Krettek <aljos...@apache.org> Sent: Thursday, August 27, 2020 16:45 To: dev@flink.apache.org <dev@flink.apache.org> Subject: Re: [DISCUSS] Removing deprecated methods from DataStream API Did you consider DataStream.project() yet? In general I think we should remove most of the relational-ish methods from DataStream. More candidates in this set of methods would be the tuple index/expression methods for aggregations like min/max etc... Aljoscha On 25.08.20 20:52, Konstantin Knauf wrote: > I would argue that the guarantees of @Public methods that became > ineffective were broken when they became ineffective (and were deprecated). > > - ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10) > - ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9) > > Removing these methods seems like the better of two evils to me as it shows > users that they have been using no-ops for some time. > > On Thu, Aug 20, 2020 at 10:50 AM Stephan Ewen <se...@apache.org> wrote: > >> We have removed some public methods in the past, after a careful >> deprecation period, if they were not well working any more. >> >> The sentiment I got from users is that careful cleanup is in fact >> appreciated, otherwise things get confusing over time (the deprecated >> methods cause noise in the API). >> Still, we need to be very careful here. >> >> I would suggest to >> - start with the non-public breaking methods >> - remove fold() (very long deprecated) >> - remove split() buggy >> >> Removing the env.socketStream() and env.fileStream() methods would >> probably be good, too. They are very long deprecated and they don't work >> well (with checkpoints) and the sources are the first thing a user needs to >> understand when starting with Flink, so removing noise there is super >> helpful. >> >> >> On Thu, Aug 20, 2020 at 8:53 AM Dawid Wysakowicz <dwysakow...@apache.org> >> wrote: >> >>> Hey Till, >>> >>> You've got a good point here. Removing some of the methods would cause >>> breaking the stability guarantees. I do understand if we decide not to >>> remove them for that reason, let me explain though why I am thinking it >>> might make sense to remove them already. First of all I am a bit afraid it >>> might take a long time before we arrive at the 2.0 version. We have not >>> ever discussed that in the community. At the same time a lot of the methods >>> already don't work or are buggy, and we do not fix them any more. >>> >>> Methods which removing would not break the Public guarantees: >>> >>> - ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9) >>> - RuntimeContext#getAllAccumulators (deprecated in 0.10) >>> - ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7) >>> - >>> StreamExecutionEnvironment#setNumberOfExecutionRetries/getNumberOfExecutionRetries >>> (not the equivalent in the ExecutionConfig) >>> - StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) >>> (deprecated in 1.5) >>> >>> Methods which removing would break the Public guarantees: >>> >>> which have no effect: >>> >>> - ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10) >>> - ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9) >>> >>> which are buggy or discouraged and thus we do not support fixing them: >>> >>> - DataStream#split (deprecated in 1.8) >>> - DataStream#fold and all related classes and methods such as >>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in >>> 1.3/1.4) >>> >>> The methods like: >>> >>> - >>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...), >>> >>> - methods in (Connected)DataStream that specify keys as either >>> indices or field names >>> - >>> ExecutionConfig#setNumberOfExecutionRetries/getNumberOfExecutionRetries >>> >>> should be working just fine and I feel the least eager to remove those. >>> >>> I'd suggest I will open PRs for removing the methods that will not cause >>> breakage of the Public guarantees as the general feedback was rather >>> positive. For the rest I do understand the resentment to do so and will not >>> do it in the 1.x branch. Still I think it is valuable to have the >>> discussion. >>> >>> Best, >>> >>> Dawid >>> >>> >>> On 18/08/2020 09:26, Till Rohrmann wrote: >>> >>> Having looked at the proposed set of methods to remove I've noticed that >>> some of them are actually annotated with @Public. According to our >>> stability guarantees, only major releases (1.0, 2.0, etc.) can break APIs >>> with this annotation. Hence, I believe that we cannot simply remove them >>> unless the community decides to change the stability guarantees we give or >>> by making the next release a major release (Flink 2.0). >>> >>> Cheers, >>> Till >>> >>> On Tue, Aug 18, 2020 at 5:57 AM Yun Gao <yungao...@aliyun.com> wrote: >>> >>>> +1 for removing the methods that are deprecated for a while & have >>>> alternative methods. >>>> >>>> One specific thing is that if we remove the DataStream#split, do we >>>> consider enabling side-output in more operators in the future ? Currently >>>> it should be only available in ProcessFunctions, but not available to other >>>> commonly used UDF like Source or AsyncFunction[1]. >>>> >>>> One temporary solution occurs to me is to add a ProcessFunction after >>>> the operators want to use side-output. But I think the solution is not very >>>> direct to come up with and if it really works we might add it to the >>>> document of side-output. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-7954 >>>> >>>> Best, >>>> Yun >>>> >>>> ------------------Original Mail ------------------ >>>> *Sender:*Kostas Kloudas <kklou...@gmail.com> >>>> *Send Date:*Tue Aug 18 03:52:44 2020 >>>> *Recipients:*Dawid Wysakowicz <dwysakow...@apache.org> >>>> *CC:*dev <dev@flink.apache.org>, user <u...@flink.apache.org> >>>> *Subject:*Re: [DISCUSS] Removing deprecated methods from DataStream API >>>> >>>>> +1 for removing them. >>>>> >>>>> >>>>> >>>>> From a quick look, most of them (not all) have been deprecated a long >>>>> time ago. >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Kostas >>>>> >>>>> >>>>> >>>>> On Mon, Aug 17, 2020 at 9:37 PM Dawid Wysakowicz wrote: >>>>> >>>>>> >>>>> >>>>>> @David Yes, my idea was to remove any use of fold method and all >>>>> related classes including WindowedStream#fold >>>>> >>>>>> >>>>> >>>>>> @Klou Good idea to also remove the deprecated enableCheckpointing() & >>>>> StreamExecutionEnvironment#readFile and alike. I did another pass over >>>>> some >>>>> of the classes and thought we could also drop: >>>>> >>>>>> >>>>> >>>>>> ExecutionConfig#set/getCodeAnalysisMode >>>>> >>>>>> ExecutionConfig#disable/enableSysoutLogging >>>>> >>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError >>>>> >>>>>> ExecutionConfig#isLatencyTrackingEnabled >>>>> >>>>>> >>>>> >>>>>> As for the `forceCheckpointing` I am not fully convinced to doing it. >>>>> As far as I know iterations still do not participate in checkpointing >>>>> correctly. Therefore it still might make sense to force it. In other words >>>>> there is no real alternative to that method. Unless we only remove the >>>>> methods from StreamExecutionEnvironment and redirect to the setter in >>>>> CheckpointConfig. WDYT? >>>>> >>>>>> >>>>> >>>>>> An updated list of methods I suggest to remove: >>>>> >>>>>> >>>>> >>>>>> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9) >>>>> >>>>>> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10) >>>>> >>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9) >>>>> >>>>>> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7) >>>>> >>>>>> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in >>>>> 1.1?) >>>>> >>>>>> >>>>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...) >>>>> (deprecated in 1.2) >>>>> >>>>>> RuntimeContext#getAllAccumulators (deprecated in 0.10) >>>>> >>>>>> DataStream#fold and all related classes and methods such as >>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in >>>>> 1.3/1.4) >>>>> >>>>>> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) >>>>> (deprecated in 1.5) >>>>> >>>>>> DataStream#split (deprecated in 1.8) >>>>> >>>>>> Methods in (Connected)DataStream that specify keys as either indices >>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom, >>>>> ConnectedStream#keyBy, .... (deprecated in 1.11) >>>>> >>>>>> >>>>> >>>>>> Bear in mind that majority of the options listed above in >>>>> ExecutionConfig take no effect. They were left there purely to satisfy the >>>>> binary compatibility. Personally I don't see any benefit of leaving a >>>>> method and silently dropping the underlying feature. The only >>>>> configuration >>>>> that is respected is setting the number of execution retries. >>>>> >>>>>> >>>>> >>>>>> I also wanted to make it explicit that most of the changes above >>>>> would result in a binary incompatibility and require additional exclusions >>>>> in the japicmp. Those are: >>>>> >>>>>> >>>>> >>>>>> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9) >>>>> >>>>>> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10) >>>>> >>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9) >>>>> >>>>>> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7) >>>>> >>>>>> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in >>>>> 1.1?) >>>>> >>>>>> DataStream#fold and all related classes and methods such as >>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in >>>>> 1.3/1.4) >>>>> >>>>>> DataStream#split (deprecated in 1.8) >>>>> >>>>>> Methods in (Connected)DataStream that specify keys as either indices >>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom, >>>>> ConnectedStream#keyBy, .... (deprecated in 1.11) >>>>> >>>>>> >>>>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...) >>>>> (deprecated in 1.2) >>>>> >>>>>> >>>>> >>>>>> Looking forward to more opinions on the issue. >>>>> >>>>>> >>>>> >>>>>> Best, >>>>> >>>>>> >>>>> >>>>>> Dawid >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On 17/08/2020 12:49, Kostas Kloudas wrote: >>>>> >>>>>> >>>>> >>>>>> Thanks a lot for starting this Dawid, >>>>> >>>>>> >>>>> >>>>>> Big +1 for the proposed clean-up, and I would also add the deprecated >>>>> >>>>>> methods of the StreamExecutionEnvironment like: >>>>> >>>>>> >>>>> >>>>>> enableCheckpointing(long interval, CheckpointingMode mode, boolean >>>>> force) >>>>> >>>>>> enableCheckpointing() >>>>> >>>>>> isForceCheckpointing() >>>>> >>>>>> >>>>> >>>>>> readFile(FileInputFormat inputFormat,String >>>>> >>>>>> filePath,FileProcessingMode watchType,long interval, FilePathFilter >>>>> >>>>>> filter) >>>>> >>>>>> readFileStream(...) >>>>> >>>>>> >>>>> >>>>>> socketTextStream(String hostname, int port, char delimiter, long >>>>> maxRetry) >>>>> >>>>>> socketTextStream(String hostname, int port, char delimiter) >>>>> >>>>>> >>>>> >>>>>> There are more, like the (get)/setNumberOfExecutionRetries() that were >>>>> >>>>>> deprecated long ago, but I have not investigated to see if they are >>>>> >>>>>> actually easy to remove. >>>>> >>>>>> >>>>> >>>>>> Cheers, >>>>> >>>>>> Kostas >>>>> >>>>>> >>>>> >>>>>> On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz >>>>> >>>>>> wrote: >>>>> >>>>>> >>>>> >>>>>> Hi devs and users, >>>>> >>>>>> >>>>> >>>>>> I wanted to ask you what do you think about removing some of the >>>>> deprecated APIs around the DataStream API. >>>>> >>>>>> >>>>> >>>>>> The APIs I have in mind are: >>>>> >>>>>> >>>>> >>>>>> RuntimeContext#getAllAccumulators (deprecated in 0.10) >>>>> >>>>>> DataStream#fold and all related classes and methods such as >>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in >>>>> 1.3/1.4) >>>>> >>>>>> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) >>>>> (deprecated in 1.5) >>>>> >>>>>> DataStream#split (deprecated in 1.8) >>>>> >>>>>> Methods in (Connected)DataStream that specify keys as either indices >>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom, >>>>> ConnectedStream#keyBy, .... (deprecated in 1.11) >>>>> >>>>>> >>>>> >>>>>> I think the first three should be straightforward. They are long >>>>> deprecated. The getAccumulators method is not used very often in my >>>>> opinion. The same applies to the DataStream#fold which additionally is not >>>>> very performant. Lastly the setStateBackend has an alternative with a >>>>> class >>>>> from the AbstractStateBackend hierarchy, therefore it will be still code >>>>> compatible. Moreover if we remove the >>>>> #setStateBackend(AbstractStateBackend) we will get rid off warnings users >>>>> have right now when setting a statebackend as the correct method cannot be >>>>> used without an explicit casting. >>>>> >>>>>> >>>>> >>>>>> As for the DataStream#split I know there were some objections against >>>>> removing the #split method in the past. I still believe the output tags >>>>> can >>>>> replace the split method already. >>>>> >>>>>> >>>>> >>>>>> The only problem in the last set of methods I propose to remove is >>>>> that they were deprecated only in the last release and those method were >>>>> only partially deprecated. Moreover some of the methods were not >>>>> deprecated >>>>> in ConnectedStreams. Nevertheless I'd still be inclined to remove the >>>>> methods in this release. >>>>> >>>>>> >>>>> >>>>>> Let me know what do you think about it. >>>>> >>>>>> >>>>> >>>>>> Best, >>>>> >>>>>> >>>>> >>>>>> Dawid >>>> >>>> >