[jira] [Created] (FLINK-36325) Implement basic restore from checkpoint for ForStStateBackend

2024-09-19 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-36325:
---

 Summary: Implement basic restore from checkpoint for 
ForStStateBackend
 Key: FLINK-36325
 URL: https://issues.apache.org/jira/browse/FLINK-36325
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Feifan Wang


As title, implement basic restore from checkpoint for ForStStateBackend, 
rescale will be implemented later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re:[VOTE] FLIP-444: Native file copy support

2024-06-27 Thread Feifan Wang
+1 (non binding)




——

Best regards,

Feifan Wang




At 2024-06-25 16:58:22, "Piotr Nowojski"  wrote:
>Hi all,
>
>I would like to start a vote for the FLIP-444 [1]. The discussion thread is
>here [2].
>
>The vote will be open for at least 72.
>
>Best,
>Piotrek
>
>[1] https://cwiki.apache.org/confluence/x/rAn9EQ
>[2] https://lists.apache.org/thread/lkwmyjt2bnmvgx4qpp82rldwmtd4516c


Re:[ANNOUNCE] New Apache Flink Committer - Zhongqiang Gong

2024-06-17 Thread Feifan Wang
Congratulations Zhongqiang !


——

Best regards,

Feifan Wang




At 2024-06-17 11:20:30, "Leonard Xu"  wrote:
>Hi everyone,
>On behalf of the PMC, I'm happy to announce that Zhongqiang Gong has become a 
>new Flink Committer!
>
>Zhongqiang has been an active Flink community member since November 2021, 
>contributing numerous PRs to both the Flink and Flink CDC repositories. As a 
>core contributor to Flink CDC, he developed the Oracle and SQL Server CDC 
>Connectors and managed essential website and CI migrations during the donation 
>of Flink CDC to Apache Flink.
>
>Beyond his technical contributions, Zhongqiang actively participates in 
>discussions on the Flink dev mailing list and responds to threads on the user 
>and user-zh mailing lists. As an Apache StreamPark (incubating) Committer, he 
>promotes Flink SQL and Flink CDC technologies at meetups and within the 
>StreamPark community.
>
>Please join me in congratulating Zhongqiang Gong for becoming an Apache Flink 
>committer!
>
>Best,
>Leonard (on behalf of the Flink PMC)


Re:[ANNOUNCE] New Apache Flink Committer - Hang Ruan

2024-06-17 Thread Feifan Wang
Congratulations Hang !


——

Best regards,

Feifan Wang




At 2024-06-17 11:17:13, "Leonard Xu"  wrote:
>Hi everyone,
>On behalf of the PMC, I'm happy to let you know that Hang Ruan has become a 
>new Flink Committer !
>
>Hang Ruan has been continuously contributing to the Flink project since August 
>2021. Since then, he has continuously contributed to Flink, Flink CDC, and 
>various Flink connector repositories, including flink-connector-kafka, 
>flink-connector-elasticsearch, flink-connector-aws, flink-connector-rabbitmq, 
>flink-connector-pulsar, and flink-connector-mongodb. Hang Ruan focuses on the 
>improvements related to connectors and catalogs and initiated FLIP-274. He is 
>most recognized as a core contributor and maintainer for the Flink CDC 
>project, contributing many features such as MySQL CDC newly table addition and 
>the Schema Evolution feature.
>
>Beyond his technical contributions, Hang Ruan is an active member of the Flink 
>community. He regularly engages in discussions on the Flink dev mailing list 
>and the user-zh and user mailing lists, participates in FLIP discussions, 
>assists with user Q&A, and consistently volunteers for release verifications.
>
>Please join me in congratulating Hang Ruan for becoming an Apache Flink 
>committer!
>
>Best,
>Leonard (on behalf of the Flink PMC)


[jira] [Created] (FLINK-35510) Implement basic incremental checkpoint for ForStStateBackend

2024-06-03 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-35510:
---

 Summary: Implement basic incremental checkpoint for 
ForStStateBackend
 Key: FLINK-35510
 URL: https://issues.apache.org/jira/browse/FLINK-35510
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / State Backends
Reporter: Feifan Wang


Use low DB api implement a basic incremental checkpoint for ForStStatebackend, 
follow steps:
 # db.disableFileDeletions()
 # db.getLiveFiles(true)
 # db.entableFileDeletes(false)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35434) Support pass exception in StateExecutor to runtime

2024-05-23 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-35434:
---

 Summary: Support pass exception in StateExecutor to runtime
 Key: FLINK-35434
 URL: https://issues.apache.org/jira/browse/FLINK-35434
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Feifan Wang


Exception may thrown when _StateExecutor_ execute the state request , such as a 
IOException. We should pass the exception to runtime then failed the job in 
this situation.

 
_InternalStateFuture#completeExceptionally()_ will be added as [discussion 
here|https://github.com/apache/flink/pull/24739#discussion_r1590633134].
And then,  _ForStWriteBatchOperation_ and _ForStGeneralMultiGetOperation_ will 
call this method when exception occurred.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re:[ANNOUNCE] New Apache Flink PMC Member - Lincoln Lee

2024-04-15 Thread Feifan Wang
Congratulations Lincoln !


——

Best regards,

Feifan Wang




At 2024-04-12 15:59:00, "Jark Wu"  wrote:
>Hi everyone,
>
>On behalf of the PMC, I'm very happy to announce that Lincoln Lee has
>joined the Flink PMC!
>
>Lincoln has been an active member of the Apache Flink community for
>many years. He mainly works on Flink SQL component and has driven
>/pushed many FLIPs around SQL, including FLIP-282/373/415/435 in
>the recent versions. He has a great technical vision of Flink SQL and
>participated in plenty of discussions in the dev mailing list. Besides
>that,
>he is community-minded, such as being the release manager of 1.19,
>verifying releases, managing release syncs, writing the release
>announcement etc.
>
>Congratulations and welcome Lincoln!
>
>Best,
>Jark (on behalf of the Flink PMC)


Re:[ANNOUNCE] New Apache Flink PMC Member - Jing Ge

2024-04-15 Thread Feifan Wang
Congratulations, Jing!


——

Best regards,

Feifan Wang




At 2024-04-12 16:02:01, "Jark Wu"  wrote:
>Hi everyone,
>
>On behalf of the PMC, I'm very happy to announce that Jing Ge has
>joined the Flink PMC!
>
>Jing has been contributing to Apache Flink for a long time. He continuously
>works on SQL, connectors, Source, and Sink APIs, test, and document
>modules while contributing lots of code and insightful discussions. He is
>one of the maintainers of Flink CI infra. He is also willing to help a lot
>in the
>community work, such as being the release manager for both 1.18 and 1.19,
>verifying releases, and answering questions on the mailing list. Besides
>that,
>he is continuously helping with the expansion of the Flink community and
>has
>given several talks about Flink at many conferences, such as Flink Forward
>2022 and 2023.
>
>Congratulations and welcome Jing!
>
>Best,
>Jark (on behalf of the Flink PMC)


Re:[ANNOUNCE] New Apache Flink Committer - Zakelly Lan

2024-04-15 Thread Feifan Wang
Congratulations, Zakelly!——

Best regards,

Feifan Wang




At 2024-04-15 10:50:06, "Yuan Mei"  wrote:
>Hi everyone,
>
>On behalf of the PMC, I'm happy to let you know that Zakelly Lan has become
>a new Flink Committer!
>
>Zakelly has been continuously contributing to the Flink project since 2020,
>with a focus area on Checkpointing, State as well as frocksdb (the default
>on-disk state db).
>
>He leads several FLIPs to improve checkpoints and state APIs, including
>File Merging for Checkpoints and configuration/API reorganizations. He is
>also one of the main contributors to the recent efforts of "disaggregated
>state management for Flink 2.0" and drives the entire discussion in the
>mailing thread, demonstrating outstanding technical depth and breadth of
>knowledge.
>
>Beyond his technical contributions, Zakelly is passionate about helping the
>community in numerous ways. He spent quite some time setting up the Flink
>Speed Center and rebuilding the benchmark pipeline after the original one
>was out of lease. He helps build frocksdb and tests for the upcoming
>frocksdb release (bump rocksdb from 6.20.3->8.10).
>
>Please join me in congratulating Zakelly for becoming an Apache Flink
>committer!
>
>Best,
>Yuan (on behalf of the Flink PMC)


Re:[ANNOUNCE] New Apache Flink PMC Member - Lincoln Lee

2024-04-12 Thread Feifan Wang
Congratulations, Lincoln!


——

Best regards,

Feifan Wang




At 2024-04-12 15:59:00, "Jark Wu"  wrote:
>Hi everyone,
>
>On behalf of the PMC, I'm very happy to announce that Lincoln Lee has
>joined the Flink PMC!
>
>Lincoln has been an active member of the Apache Flink community for
>many years. He mainly works on Flink SQL component and has driven
>/pushed many FLIPs around SQL, including FLIP-282/373/415/435 in
>the recent versions. He has a great technical vision of Flink SQL and
>participated in plenty of discussions in the dev mailing list. Besides
>that,
>he is community-minded, such as being the release manager of 1.19,
>verifying releases, managing release syncs, writing the release
>announcement etc.
>
>Congratulations and welcome Lincoln!
>
>Best,
>Jark (on behalf of the Flink PMC)


Re:[VOTE] FLIP-441: Show the JobType and remove Execution Mode on Flink WebUI

2024-04-09 Thread Feifan Wang


+1 (non-binding)







--




——

Best regards,

Feifan Wang




At 2024-04-10 12:36:00, "Rui Fan" <1996fan...@gmail.com> wrote:
>Hi devs,
>
>Thank you to everyone for the feedback on FLIP-441: Show
>the JobType and remove Execution Mode on Flink WebUI[1]
>which has been discussed in this thread [2].
>
>I would like to start a vote for it. The vote will be open for at least 72
>hours unless there is an objection or not enough votes.
>
>[1] https://cwiki.apache.org/confluence/x/agrPEQ
>[2] https://lists.apache.org/thread/0s52w17w24x7m2zo6ogl18t1fy412vcd
>
>Best,
>Rui


Re:Re: [VOTE] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-28 Thread Feifan Wang
+1 (non-binding)




——

Best regards,

Feifan Wang




At 2024-03-28 18:43:36, "Yuan Mei"  wrote:
>+1 (binding)
>
>Best,
>Yuan
>
>On Wed, Mar 27, 2024 at 7:31 PM Jinzhong Li 
>wrote:
>
>> Hi devs,
>>
>>
>> I'd like to start a vote on the FLIP-428: Fault Tolerance/Rescale
>> Integration for Disaggregated State [1]. The discussion thread is here [2].
>>
>>
>> The vote will be open for at least 72 hours unless there is an objection or
>> insufficient votes.
>>
>> [1] https://cwiki.apache.org/confluence/x/UYp3EQ
>>
>> [2] https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
>>
>> Best,
>>
>> Jinzhong
>>


Re:Re: [VOTE] FLIP-426: Grouping Remote State Access

2024-03-28 Thread Feifan Wang
+1 (non-binding)




——

Best regards,

Feifan Wang




At 2024-03-28 18:43:12, "Yuan Mei"  wrote:
>+1 (binding)
>
>Best,
>Yuan
>
>On Wed, Mar 27, 2024 at 6:56 PM Jinzhong Li 
>wrote:
>
>> Hi devs,
>>
>> I'd like to start a vote on the FLIP-426: Grouping Remote State Access [1].
>> The discussion thread is here [2].
>>
>> The vote will be open for at least 72 hours unless there is an objection or
>> insufficient votes.
>>
>>
>> [1] https://cwiki.apache.org/confluence/x/TYp3EQ
>>
>> [2] https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
>>
>>
>> Best,
>>
>> Jinzhong
>>


Re:Re: [VOTE] FLIP-427: Disaggregated state Store

2024-03-28 Thread Feifan Wang
+1 (non-binding)




——

Best regards,

Feifan Wang




At 2024-03-28 19:01:01, "Yuan Mei"  wrote:
>+1 (binding)
>
>Best
>Yuan
>
>
>
>
>On Wed, Mar 27, 2024 at 6:37 PM Hangxiang Yu  wrote:
>
>> Hi devs,
>>
>> Thanks all for your valuable feedback about FLIP-427: Disaggregated state
>> Store [1].
>> I'd like to start a vote on it.  The discussion thread is here [2].
>>
>> The vote will be open for at least 72 hours unless there is an objection or
>> insufficient votes.
>>
>> [1] https://cwiki.apache.org/confluence/x/T4p3EQ
>> [2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
>>
>>
>> Best,
>> Hangxiang
>>


Re:Re: [VOTE] FLIP-425: Asynchronous Execution Model

2024-03-28 Thread Feifan Wang
+1 (non-binding)



——

Best regards,

Feifan Wang




At 2024-03-28 20:20:39, "Piotr Nowojski"  wrote:
>+1 (binding)
>
>Piotrek
>
>czw., 28 mar 2024 o 11:44 Yuan Mei  napisał(a):
>
>> +1 (binding)
>>
>> Best,
>> Yuan
>>
>> On Thu, Mar 28, 2024 at 4:33 PM Xuannan Su  wrote:
>>
>> > +1 (non-binding)
>> >
>> > Best regards,
>> > Xuannan
>> >
>> > On Wed, Mar 27, 2024 at 6:28 PM Yanfei Lei  wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > > Thanks for all the feedback about the FLIP-425: Asynchronous Execution
>> > > Model [1]. The discussion thread is here [2].
>> > >
>> > > The vote will be open for at least 72 hours unless there is an
>> > > objection or insufficient votes.
>> > >
>> > > [1] https://cwiki.apache.org/confluence/x/S4p3EQ
>> > > [2] https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
>> > >
>> > > Best regards,
>> > > Yanfei
>> >
>>


Re:Re: [VOTE] FLIP-424: Asynchronous State APIs

2024-03-28 Thread Feifan Wang
+1 (non-binding)




——

Best regards,

Feifan Wang




At 2024-03-28 20:21:50, "Jing Ge"  wrote:
>+1 (binding)
>
>Thanks!
>
>Best regards,
>Jing
>
>On Wed, Mar 27, 2024 at 11:23 AM Zakelly Lan  wrote:
>
>> Hi devs,
>>
>> I'd like to start a vote on the FLIP-424: Asynchronous State APIs [1]. The
>> discussion thread is here [2].
>>
>> The vote will be open for at least 72 hours unless there is an objection or
>> insufficient votes.
>>
>> [1] https://cwiki.apache.org/confluence/x/SYp3EQ
>> [2] https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
>>
>>
>> Best,
>> Zakelly
>>


Re:Re: [VOTE] FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)

2024-03-28 Thread Feifan Wang
+1 (non-binding)

——

Best regards,

Feifan Wang




At 2024-03-28 20:20:21, "Piotr Nowojski"  wrote:
>+1 (binding)
>
>Piotrek
>
>czw., 28 mar 2024 o 11:43 Yuan Mei  napisał(a):
>
>> Hi devs,
>>
>> I'd like to start a vote on the FLIP-423: Disaggregated State Storage and
>> Management (Umbrella FLIP) [1]. The discussion thread is here [2].
>>
>> The vote will be open for at least 72 hours unless there is an objection or
>> insufficient votes.
>>
>> [1] https://cwiki.apache.org/confluence/x/R4p3EQ
>> [2] https://lists.apache.org/thread/ct8smn6g9y0b8730z7rp9zfpnwmj8vf0
>>
>>
>> Best,
>> Yuan
>>


Re:Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread Feifan Wang
Congratulations!——

Best regards,

Feifan Wang




At 2024-03-28 20:02:43, "Yanfei Lei"  wrote:
>Congratulations!
>
>Best,
>Yanfei
>
>Zhanghao Chen  于2024年3月28日周四 19:59写道:
>>
>> Congratulations!
>>
>> Best,
>> Zhanghao Chen
>> 
>> From: Yu Li 
>> Sent: Thursday, March 28, 2024 15:55
>> To: d...@paimon.apache.org 
>> Cc: dev ; user 
>> Subject: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project
>>
>> CC the Flink user and dev mailing list.
>>
>> Paimon originated within the Flink community, initially known as Flink
>> Table Store, and all our incubating mentors are members of the Flink
>> Project Management Committee. I am confident that the bonds of
>> enduring friendship and close collaboration will continue to unite the
>> two communities.
>>
>> And congratulations all!
>>
>> Best Regards,
>> Yu
>>
>> On Wed, 27 Mar 2024 at 20:35, Guojun Li  wrote:
>> >
>> > Congratulations!
>> >
>> > Best,
>> > Guojun
>> >
>> > On Wed, Mar 27, 2024 at 5:24 PM wulin  wrote:
>> >
>> > > Congratulations~
>> > >
>> > > > 2024年3月27日 15:54,王刚  写道:
>> > > >
>> > > > Congratulations~
>> > > >
>> > > >> 2024年3月26日 10:25,Jingsong Li  写道:
>> > > >>
>> > > >> Hi Paimon community,
>> > > >>
>> > > >> I’m glad to announce that the ASF board has approved a resolution to
>> > > >> graduate Paimon into a full Top Level Project. Thanks to everyone for
>> > > >> your help to get to this point.
>> > > >>
>> > > >> I just created an issue to track the things we need to modify [2],
>> > > >> please comment on it if you feel that something is missing. You can
>> > > >> refer to apache documentation [1] too.
>> > > >>
>> > > >> And, we already completed the GitHub repo migration [3], please update
>> > > >> your local git repo to track the new repo [4].
>> > > >>
>> > > >> You can run the following command to complete the remote repo tracking
>> > > >> migration.
>> > > >>
>> > > >> git remote set-url origin https://github.com/apache/paimon.git
>> > > >>
>> > > >> If you have a different name, please change the 'origin' to your 
>> > > >> remote
>> > > name.
>> > > >>
>> > > >> Please join me in celebrating!
>> > > >>
>> > > >> [1]
>> > > https://incubator.apache.org/guides/transferring.html#life_after_graduation
>> > > >> [2] https://github.com/apache/paimon/issues/3091
>> > > >> [3] https://issues.apache.org/jira/browse/INFRA-25630
>> > > >> [4] https://github.com/apache/paimon
>> > > >>
>> > > >> Best,
>> > > >> Jingsong Lee
>> > >
>> > >


Re: [DISCUSS] FLIP-427: Disaggregated State Store

2024-03-27 Thread Feifan Wang
Thanks for your reply, Hangxiang. I totally agree with you about the jni part.

Hi Yun Tang, I just noticed that FLIP-427 mentions “The life cycle of working 
dir is managed as before local strategy.” IIUC, the working dir will be deleted 
after TaskManager exit. And I think that's enough for current stage, WDYT ?

——

Best regards,

Feifan Wang




At 2024-03-28 12:18:56, "Hangxiang Yu"  wrote:
>Hi, Feifan.
>
>Thanks for your reply.
>
>What if we only use jni to access DFS that needs to reuse Flink FileSystem?
>> And all local disk access through native api. This idea is based on the
>> understanding that jni overhead is not worth mentioning compared to DFS
>> access latency. It might make more sense to consider avoiding jni overhead
>> for faster local disks. Since local disk as secondary is already under
>> consideration [1], maybe we can discuss in that FLIP whether to use native
>> api to access local disk?
>>
>This is a good suggestion. It's reasonable to use native api to access
>local disk cache since it requires lower latency compared to remote access.
>I also believe that the jni overhead is relatively negligible when weighed
>against the latency of remote I/O as mentioned in the FLIP.
>So I think we could just go on proposal 2 and keep proposal 1 as a
>potential future optimization, which could work better when there is a
>higher performance requirement or some native libraries of filesystems have
>significantly higher performance and resource usage compared to their java
>libs.
>
>
>On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang  wrote:
>
>> Thanks for this valuable proposal Hangxiang !
>>
>>
>> > If we need to introduce a JNI call during each filesystem call, that
>> would be N times JNI cost compared with the current RocksDB state-backend's
>> JNI cost.
>> What if we only use jni to access DFS that needs to reuse Flink
>> FileSystem? And all local disk access through native api. This idea is
>> based on the understanding that jni overhead is not worth mentioning
>> compared to DFS access latency. It might make more sense to consider
>> avoiding jni overhead for faster local disks. Since local disk as secondary
>> is already under consideration [1], maybe we can discuss in that FLIP
>> whether to use native api to access local disk?
>>
>>
>> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>> >Different disaggregated state storages may have their own semantics about
>> >this configuration, e.g. life cycle, supported file systems or storages.
>> I agree with considering moving this configuration up to the engine level
>> until there are other disaggreated backends.
>>
>>
>> [1] https://cwiki.apache.org/confluence/x/U4p3EQ
>>
>> ——
>>
>> Best regards,
>>
>> Feifan Wang
>>
>>
>>
>>
>> At 2024-03-28 09:55:48, "Hangxiang Yu"  wrote:
>> >Hi, Yun.
>> >Thanks for the reply.
>> >
>> >The JNI cost you considered is right. As replied to Yue, I agreed to leave
>> >space and consider proposal 1 as an optimization in the future, which is
>> >also updated in the FLIP.
>> >
>> >The other question is that the configuration of
>> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> >> state-backend, how would it be if we introduce another disaggregated
>> state
>> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might
>> be a
>> >> better configuration name.
>> >
>> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>> >Different disaggregated state storages may have their own semantics about
>> >this configuration, e.g. life cycle, supported file systems or storages.
>> >Maybe it's more suitable to consider it together when we introduce other
>> >disaggregated state storages in the future.
>> >
>> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang  wrote:
>> >
>> >> Hi Hangxiang,
>> >>
>> >> The design looks good, and I also support leaving space for proposal 1.
>> >>
>> >> As you know, loading index/filter/data blocks for querying across levels
>> >> would introduce high IO access within the LSM tree for old data. If we
>> need
>> >> to introduce a JNI call during each filesystem call, that would be N
>> times
>> >> JNI cost compared with the current RocksDB state-backend's JNI cost.
>> >>
>> >> The other ques

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Feifan Wang
And I think the cleanup of working dir should be discussion in FLIP-427[1] ( 
this mail list [2]) ? 


[1] https://cwiki.apache.org/confluence/x/T4p3EQ
[2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft

——

Best regards,

Feifan Wang




At 2024-03-28 11:56:22, "Feifan Wang"  wrote:
>Hi Jinzhong :
>
>
>> I suggest that we could postpone this topic for now and consider it 
>> comprehensively combined with the TM ownership file management in the future 
>> FLIP.
>
>
>Sorry I still think we should consider the cleanup of the working dir in this 
>FLIP, although we may come up with a better solution in a subsequent flip, I 
>think it is important to maintain the integrity of the current changes. 
>Otherwise we may suffer from wasted DFS space for some time.
>Perhaps we only need a simple cleanup strategy at this stage, such as 
>proactive cleanup when TM exits. While this may fail in the case of a TM 
>crash, it already alleviates the problem.
>
>
>
>
>——
>
>Best regards,
>
>Feifan Wang
>
>
>
>
>At 2024-03-28 11:15:11, "Jinzhong Li"  wrote:
>>Hi Yun,
>>
>>Thanks for your reply.
>>
>>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
>>> under the shared directory? if we don't consider making
>>> TM ownership in this FLIP, this design seems unnecessary.
>>
>> Good catch! We will not change the directory layout of shared directory in
>>this FLIP. I have already removed this part from this FLIP. I think we
>>could revisit this topic in a future FLIP about TM ownership.
>>
>>> 2. This FLIP forgets to mention the cleanup of the remote
>>> working directory in case of the taskmanager crushes,
>>> even though this is an open problem, we can still leave
>>> some space for future optimization.
>>
>>Considering that we have plans to merge TM working dir and checkpoint dir
>>into one directory, I suggest that we could postpone this topic for now and
>>consider it comprehensively combined with the TM ownership file management
>>in the future FLIP.
>>
>>Best,
>>Jinzhong
>>
>>
>>
>>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang  wrote:
>>
>>> Hi Jinzhong,
>>>
>>> The overall design looks good.
>>>
>>> I have two minor questions:
>>>
>>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared
>>> directory? if we don't consider making TM ownership in this FLIP, this
>>> design seems unnecessary.
>>> 2. This FLIP forgets to mention the cleanup of the remote working
>>> directory in case of the taskmanager crushes, even though this is an open
>>> problem, we can still leave some space for future optimization.
>>>
>>> Best,
>>> Yun Tang
>>>
>>> 
>>> From: Jinzhong Li 
>>> Sent: Monday, March 25, 2024 10:41
>>> To: dev@flink.apache.org 
>>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for
>>> Disaggregated State
>>>
>>> Hi Yue,
>>>
>>> Thanks for your comments.
>>>
>>> The CURRENT is a special file that points to the latest manifest log
>>> file. As Zakelly explained above, we could record the latest manifest
>>> filename during sync phase, and write the filename into CURRENT snapshot
>>> file during async phase.
>>>
>>> Best,
>>> Jinzhong
>>>
>>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan 
>>> wrote:
>>>
>>> > Hi Yue,
>>> >
>>> > Thanks for bringing this up!
>>> >
>>> > The CURRENT FILE is the special one, which should be snapshot during the
>>> > sync phase (temporary load into memory). Thus we can solve this.
>>> >
>>> >
>>> > Best,
>>> > Zakelly
>>> >
>>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma  wrote:
>>> >
>>> > > Hi jinzhong,
>>> > > Thanks for you reply. I still have some doubts about the first
>>> question.
>>> > Is
>>> > > there such a case
>>> > > When you made a snapshot during the synchronization phase, you recorded
>>> > the
>>> > > current and manifest 8, but before asynchronous phase, the manifest
>>> > reached
>>> > > the size threshold and then the CURRENT FILE pointed to the new
>>> manifest
>>> > 9,
>>>

Re:Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Feifan Wang
Hi Jinzhong :


> I suggest that we could postpone this topic for now and consider it 
> comprehensively combined with the TM ownership file management in the future 
> FLIP.


Sorry I still think we should consider the cleanup of the working dir in this 
FLIP, although we may come up with a better solution in a subsequent flip, I 
think it is important to maintain the integrity of the current changes. 
Otherwise we may suffer from wasted DFS space for some time.
Perhaps we only need a simple cleanup strategy at this stage, such as proactive 
cleanup when TM exits. While this may fail in the case of a TM crash, it 
already alleviates the problem.




——

Best regards,

Feifan Wang




At 2024-03-28 11:15:11, "Jinzhong Li"  wrote:
>Hi Yun,
>
>Thanks for your reply.
>
>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
>> under the shared directory? if we don't consider making
>> TM ownership in this FLIP, this design seems unnecessary.
>
> Good catch! We will not change the directory layout of shared directory in
>this FLIP. I have already removed this part from this FLIP. I think we
>could revisit this topic in a future FLIP about TM ownership.
>
>> 2. This FLIP forgets to mention the cleanup of the remote
>> working directory in case of the taskmanager crushes,
>> even though this is an open problem, we can still leave
>> some space for future optimization.
>
>Considering that we have plans to merge TM working dir and checkpoint dir
>into one directory, I suggest that we could postpone this topic for now and
>consider it comprehensively combined with the TM ownership file management
>in the future FLIP.
>
>Best,
>Jinzhong
>
>
>
>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang  wrote:
>
>> Hi Jinzhong,
>>
>> The overall design looks good.
>>
>> I have two minor questions:
>>
>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared
>> directory? if we don't consider making TM ownership in this FLIP, this
>> design seems unnecessary.
>> 2. This FLIP forgets to mention the cleanup of the remote working
>> directory in case of the taskmanager crushes, even though this is an open
>> problem, we can still leave some space for future optimization.
>>
>> Best,
>> Yun Tang
>>
>> 
>> From: Jinzhong Li 
>> Sent: Monday, March 25, 2024 10:41
>> To: dev@flink.apache.org 
>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for
>> Disaggregated State
>>
>> Hi Yue,
>>
>> Thanks for your comments.
>>
>> The CURRENT is a special file that points to the latest manifest log
>> file. As Zakelly explained above, we could record the latest manifest
>> filename during sync phase, and write the filename into CURRENT snapshot
>> file during async phase.
>>
>> Best,
>> Jinzhong
>>
>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan 
>> wrote:
>>
>> > Hi Yue,
>> >
>> > Thanks for bringing this up!
>> >
>> > The CURRENT FILE is the special one, which should be snapshot during the
>> > sync phase (temporary load into memory). Thus we can solve this.
>> >
>> >
>> > Best,
>> > Zakelly
>> >
>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma  wrote:
>> >
>> > > Hi jinzhong,
>> > > Thanks for you reply. I still have some doubts about the first
>> question.
>> > Is
>> > > there such a case
>> > > When you made a snapshot during the synchronization phase, you recorded
>> > the
>> > > current and manifest 8, but before asynchronous phase, the manifest
>> > reached
>> > > the size threshold and then the CURRENT FILE pointed to the new
>> manifest
>> > 9,
>> > > and then uploaded the incorrect CURRENT file ?
>> > >
>> > > Jinzhong Li  于2024年3月20日周三 20:13写道:
>> > >
>> > > > Hi Yue,
>> > > >
>> > > > Thanks for your feedback!
>> > > >
>> > > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest
>> > File
>> > > > > ? Should we take a snapshot of the Manifest during the
>> > synchronization
>> > > > phase?
>> > > >
>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo
>> of
>> > > > Manifest files, and this api also return the manifest file size,
>> which
>> > > > means this ap

Re:Re: [DISCUSS] FLIP-427: Disaggregated State Store

2024-03-27 Thread Feifan Wang
Thanks for this valuable proposal Hangxiang !


> If we need to introduce a JNI call during each filesystem call, that would be 
> N times JNI cost compared with the current RocksDB state-backend's JNI cost.
What if we only use jni to access DFS that needs to reuse Flink FileSystem? And 
all local disk access through native api. This idea is based on the 
understanding that jni overhead is not worth mentioning compared to DFS access 
latency. It might make more sense to consider avoiding jni overhead for faster 
local disks. Since local disk as secondary is already under consideration [1], 
maybe we can discuss in that FLIP whether to use native api to access local 
disk?


>I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>Different disaggregated state storages may have their own semantics about
>this configuration, e.g. life cycle, supported file systems or storages.
I agree with considering moving this configuration up to the engine level until 
there are other disaggreated backends.


[1] https://cwiki.apache.org/confluence/x/U4p3EQ

——————

Best regards,

Feifan Wang




At 2024-03-28 09:55:48, "Hangxiang Yu"  wrote:
>Hi, Yun.
>Thanks for the reply.
>
>The JNI cost you considered is right. As replied to Yue, I agreed to leave
>space and consider proposal 1 as an optimization in the future, which is
>also updated in the FLIP.
>
>The other question is that the configuration of
>> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> state-backend, how would it be if we introduce another disaggregated state
>> storage? Thus, I think `state.backend.disaggregated.working-dir` might be a
>> better configuration name.
>
>I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>Different disaggregated state storages may have their own semantics about
>this configuration, e.g. life cycle, supported file systems or storages.
>Maybe it's more suitable to consider it together when we introduce other
>disaggregated state storages in the future.
>
>On Thu, Mar 28, 2024 at 12:02 AM Yun Tang  wrote:
>
>> Hi Hangxiang,
>>
>> The design looks good, and I also support leaving space for proposal 1.
>>
>> As you know, loading index/filter/data blocks for querying across levels
>> would introduce high IO access within the LSM tree for old data. If we need
>> to introduce a JNI call during each filesystem call, that would be N times
>> JNI cost compared with the current RocksDB state-backend's JNI cost.
>>
>> The other question is that the configuration of
>> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> state-backend, how would it be if we introduce another disaggregated state
>> storage? Thus, I think `state.backend.disaggregated.working-dir` might be a
>> better configuration name.
>>
>>
>> Best
>> Yun Tang
>>
>> 
>> From: Hangxiang Yu 
>> Sent: Wednesday, March 20, 2024 11:32
>> To: dev@flink.apache.org 
>> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
>>
>> Hi, Yue.
>> Thanks for the reply.
>>
>> If we use proposal1, we can easily reuse these optimizations .It is even
>> > possible to discuss and review the solution together in the Rocksdb
>> > community.
>>
>> We also saw these useful optimizations which could be applied to ForSt in
>> the future.
>> But IIUC, it's not binding to proposal 1, right? We could also
>> implement interfaces about temperature and secondary cache to reuse them,
>> or organize a more complex HybridEnv based on proposal 2.
>>
>> My point is whether we should retain the potential of proposal 1 in the
>> > design.
>> >
>> This is a good suggestion. We choose proposal 2 firstly due to its
>> maintainability and scalability, especially because it could leverage all
>> filesystems flink supported conveniently.
>> Given the indelible advantage in performance, I think we could also
>> consider proposal 1 as an optimization in the future.
>> For the interface on the DB side, we could also expose more different Envs
>> in the future.
>>
>>
>> On Tue, Mar 19, 2024 at 9:14 PM yue ma  wrote:
>>
>> > Hi Hangxiang,
>> >
>> > Thanks for bringing this discussion.
>> > I have a few questions about the Proposal you mentioned in the FLIP.
>> >
>> > The current conclusion is to use proposal 2, which is okay for me. My
>> point
>> > is whether we should retain the potential of proposal 1 in the design.
>> > There are the following reasons:
>> > 1.

Re:Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

2024-03-26 Thread Feifan Wang
Congratulations !


在 2024-03-22 12:04:39,"Hangxiang Yu"  写道:
>Congratulations!
>Thanks for the efforts.
>
>On Fri, Mar 22, 2024 at 10:00 AM Yanfei Lei  wrote:
>
>> Congratulations!
>>
>> Best regards,
>> Yanfei
>>
>> Xuannan Su  于2024年3月22日周五 09:21写道:
>> >
>> > Congratulations!
>> >
>> > Best regards,
>> > Xuannan
>> >
>> > On Fri, Mar 22, 2024 at 9:17 AM Charles Zhang 
>> wrote:
>> > >
>> > > Congratulations!
>> > >
>> > > Best wishes,
>> > > Charles Zhang
>> > > from Apache InLong
>> > >
>> > >
>> > > Jeyhun Karimov  于2024年3月22日周五 04:16写道:
>> > >
>> > > > Great news! Congratulations!
>> > > >
>> > > > Regards,
>> > > > Jeyhun
>> > > >
>> > > > On Thu, Mar 21, 2024 at 2:00 PM Yuxin Tan 
>> wrote:
>> > > >
>> > > > > Congratulations! Thanks for the efforts.
>> > > > >
>> > > > >
>> > > > > Best,
>> > > > > Yuxin
>> > > > >
>> > > > >
>> > > > > Samrat Deb  于2024年3月21日周四 20:28写道:
>> > > > >
>> > > > > > Congratulations !
>> > > > > >
>> > > > > > Bests
>> > > > > > Samrat
>> > > > > >
>> > > > > > On Thu, 21 Mar 2024 at 5:52 PM, Ahmed Hamdy <
>> hamdy10...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Congratulations, great work and great news.
>> > > > > > > Best Regards
>> > > > > > > Ahmed Hamdy
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, 21 Mar 2024 at 11:41, Benchao Li > >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > Congratulations, and thanks for the great work!
>> > > > > > > >
>> > > > > > > > Yuan Mei  于2024年3月21日周四 18:31写道:
>> > > > > > > > >
>> > > > > > > > > Thanks for driving these efforts!
>> > > > > > > > >
>> > > > > > > > > Congratulations
>> > > > > > > > >
>> > > > > > > > > Best
>> > > > > > > > > Yuan
>> > > > > > > > >
>> > > > > > > > > On Thu, Mar 21, 2024 at 4:35 PM Yu Li 
>> wrote:
>> > > > > > > > >
>> > > > > > > > > > Congratulations and look forward to its further
>> development!
>> > > > > > > > > >
>> > > > > > > > > > Best Regards,
>> > > > > > > > > > Yu
>> > > > > > > > > >
>> > > > > > > > > > On Thu, 21 Mar 2024 at 15:54, ConradJam <
>> jam.gz...@gmail.com>
>> > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > Congrattulations!
>> > > > > > > > > > >
>> > > > > > > > > > > Leonard Xu  于2024年3月20日周三 21:36写道:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi devs and users,
>> > > > > > > > > > > >
>> > > > > > > > > > > > We are thrilled to announce that the donation of
>> Flink CDC
>> > > > > as a
>> > > > > > > > > > > > sub-project of Apache Flink has completed. We invite
>> you to
>> > > > > > > explore
>> > > > > > > > > > the new
>> > > > > > > > > > > > resources available:
>> > > > > > > > > > > >
>> > > > > > > > > > > > - GitHub Repository:
>> https://github.com/apache/flink-cdc
>> > > > > > > > > > > > - Flink CDC Documentation:
>> > > > > > > > > > > >
>> https://nightlies.apache.org/flink/flink-cdc-docs-stable
>> > > > > > > > > > > >
>> > > > > > > > > > > > After Flink community accepted this donation[1], we
>> have
>> > > > > > > completed
>> > > > > > > > > > > > software copyright signing, code repo migration, code
>> > > > > cleanup,
>> > > > > > > > website
>> > > > > > > > > > > > migration, CI migration and github issues migration
>> etc.
>> > > > > > > > > > > > Here I am particularly grateful to Hang Ruan,
>> Zhongqaing
>> > > > > Gong,
>> > > > > > > > > > Qingsheng
>> > > > > > > > > > > > Ren, Jiabao Sun, LvYanquan, loserwang1024 and other
>> > > > > > contributors
>> > > > > > > > for
>> > > > > > > > > > their
>> > > > > > > > > > > > contributions and help during this process!
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > For all previous contributors: The contribution
>> process has
>> > > > > > > > slightly
>> > > > > > > > > > > > changed to align with the main Flink project. To
>> report
>> > > > bugs
>> > > > > or
>> > > > > > > > > > suggest new
>> > > > > > > > > > > > features, please open tickets
>> > > > > > > > > > > > Apache Jira (https://issues.apache.org/jira).  Note
>> that
>> > > > we
>> > > > > > will
>> > > > > > > > no
>> > > > > > > > > > > > longer accept GitHub issues for these purposes.
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > Welcome to explore the new repository and
>> documentation.
>> > > > Your
>> > > > > > > > feedback
>> > > > > > > > > > and
>> > > > > > > > > > > > contributions are invaluable as we continue to
>> improve
>> > > > Flink
>> > > > > > CDC.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks everyone for your support and happy exploring
>> Flink
>> > > > > CDC!
>> > > > > > > > > > > >
>> > > > > > > > > > > > Best,
>> > > > > > > > > > > > Leonard
>> > > > > > > > > > > > [1]
>> > > > > > > >
>> https://lists.apache.org/thread/cw29fhsp99243yfo95xrkw82s5s418ob
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > > Best
>> > > > > > > > > > >
>> > > > > > > > > > > ConradJam
>> > > > > > > > > >
>> > > > > > > >
>> > > > > > > 

[jira] [Created] (FLINK-33734) Merge unaligned checkpoint

2023-12-03 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-33734:
---

 Summary: Merge unaligned checkpoint 
 Key: FLINK-33734
 URL: https://issues.apache.org/jira/browse/FLINK-33734
 Project: Flink
  Issue Type: Improvement
Reporter: Feifan Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32908) Fix wrong materialization id setting when switching from non-log-based checkpoint to log-based checkpoint

2023-08-22 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32908:
---

 Summary: Fix wrong materialization id setting when switching from 
non-log-based checkpoint to log-based checkpoint
 Key: FLINK-32908
 URL: https://issues.apache.org/jira/browse/FLINK-32908
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


Initial materialization ID should be set to checkpointID when switching from 
non-log-based checkpoint to log-based checkpoint. Currently initial id will be 
set to 0, which will cause the incremental snapshot of inner backend go wrong.

PTAL [~Yanfei Lei] , [~roman] .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32901) Should set initial

2023-08-21 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32901:
---

 Summary: Should set initial
 Key: FLINK-32901
 URL: https://issues.apache.org/jira/browse/FLINK-32901
 Project: Flink
  Issue Type: Bug
Reporter: Feifan Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl

2023-08-07 Thread Feifan Wang
Congrats Matthias!



——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Matthias Pohl |
| Date | 08/7/2023 16:16 |
| To |  |
| Subject | Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl |
Thanks everyone. :)

On Mon, Aug 7, 2023 at 3:18 AM Andriy Redko  wrote:

Congrats Matthias, well deserved!!

DC> Congrats Matthias!

DC> Very well deserved, thankyou for your continuous, consistent
contributions.
DC> Welcome.

DC> Thanks,
DC> Danny

DC> On Fri, Aug 4, 2023 at 9:30 AM Feng Jin  wrote:

Congratulations, Matthias!

Best regards

Feng

On Fri, Aug 4, 2023 at 4:29 PM weijie guo 
wrote:

Congratulations, Matthias!

Best regards,

Weijie


Wencong Liu  于2023年8月4日周五 15:50写道:

Congratulations, Matthias!

Best,
Wencong Liu

















At 2023-08-04 11:18:00, "Xintong Song" 
wrote:
Hi everyone,

On behalf of the PMC, I'm very happy to announce that Matthias Pohl
has
joined the Flink PMC!

Matthias has been consistently contributing to the project since
Sep
2020,
and became a committer in Dec 2021. He mainly works in Flink's
distributed
coordination and high availability areas. He has worked on many
FLIPs
including FLIP195/270/285. He helped a lot with the release
management,
being one of the Flink 1.17 release managers and also very active
in
Flink
1.18 / 2.0 efforts. He also contributed a lot to improving the
build
stability.

Please join me in congratulating Matthias!

Best,

Xintong (on behalf of the Apache Flink PMC)







Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu

2023-08-07 Thread Feifan Wang
Congratulations, Weihua!



——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Ming Li |
| Date | 08/7/2023 16:52 |
| To |  |
| Subject | Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu |
Congratulations, Weihua!

Best,
Ming Li


Matthias Pohl  于2023年8月7日周一 16:18写道:

Congratulations, Weihua.

On Mon, Aug 7, 2023 at 3:50 AM Yuepeng Pan  wrote:




Congratulations, Weihua!

Best,
Yuepeng Pan





在 2023-08-07 09:17:41,"yh z"  写道:
Congratulations, Weihua!

Best,
Yunhong Zheng (Swuferhong)

Runkang He  于2023年8月5日周六 21:34写道:

Congratulations, Weihua!

Best,
Runkang He

Kelu Tao  于2023年8月4日周五 18:21写道:

Congratulations!

On 2023/08/04 08:35:49 Danny Cranmer wrote:
Congrats and welcome to the team, Weihua!

Thanks,
Danny

On Fri, Aug 4, 2023 at 9:30 AM Feng Jin 
wrote:

Congratulations Weihua!

Best regards,

Feng

On Fri, Aug 4, 2023 at 4:28 PM weijie guo <
guoweijieres...@gmail.com

wrote:

Congratulations Weihua!

Best regards,

Weijie


Lijie Wang  于2023年8月4日周五 15:28写道:

Congratulations, Weihua!

Best,
Lijie

yuxia  于2023年8月4日周五 15:14写道:

Congratulations, Weihua!

Best regards,
Yuxia

- 原始邮件 -
发件人: "Yun Tang" 
收件人: "dev" 
发送时间: 星期五, 2023年 8 月 04日 下午 3:05:30
主题: Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu

Congratulations, Weihua!


Best
Yun Tang

From: Jark Wu 
Sent: Friday, August 4, 2023 15:00
To: dev@flink.apache.org 
Subject: Re: [ANNOUNCE] New Apache Flink Committer -
Weihua
Hu

Congratulations, Weihua!

Best,
Jark

On Fri, 4 Aug 2023 at 14:48, Yuxin Tan <
tanyuxinw...@gmail.com

wrote:

Congratulations Weihua!

Best,
Yuxin


Junrui Lee  于2023年8月4日周五 14:28写道:

Congrats, Weihua!
Best,
Junrui

Geng Biao  于2023年8月4日周五 14:25写道:

Congrats, Weihua!
Best,
Biao Geng

发送自 Outlook for iOS<https://aka.ms/o0ukef>

发件人: 周仁祥 
发送时间: Friday, August 4, 2023 2:23:42 PM
收件人: dev@flink.apache.org 
抄送: Weihua Hu 
主题: Re: [ANNOUNCE] New Apache Flink Committer -
Weihua Hu

Congratulations, Weihua~

2023年8月4日 14:21,Sergey Nuyanzin <
snuyan...@gmail.com>
写道:

Congratulations, Weihua!

On Fri, Aug 4, 2023 at 8:03 AM Chen Zhanghao <
zhanghao.c...@outlook.com>
wrote:

Congratulations, Weihua!

Best,
Zhanghao Chen

发件人: Xintong Song 
发送时间: 2023年8月4日 11:18
收件人: dev 
抄送: Weihua Hu 
主题: [ANNOUNCE] New Apache Flink Committer -
Weihua
Hu

Hi everyone,

On behalf of the PMC, I'm very happy to announce
Weihua
Hu
as
a
new
Flink
Committer!

Weihua has been consistently contributing to the
project
since
May
2022. He
mainly works in Flink's distributed coordination
areas.
He
is
the
main
contributor of FLIP-298 and many other
improvements in
large-scale
job
scheduling and improvements. He is also quite
active
in
mailing
lists,
participating discussions and answering user
questions.

Please join me in congratulating Weihua!

Best,

Xintong (on behalf of the Apache Flink PMC)



--
Best regards,
Sergey















Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu

2023-08-07 Thread Feifan Wang
Congratulations Hangxiang! :) 





——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Mang Zhang |
| Date | 08/7/2023 18:56 |
| To |  |
| Subject | Re:Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu |
Congratulations--

Best regards,
Mang Zhang





在 2023-08-07 18:18:08,"Yuxin Tan"  写道:
Congrats, Hangxiang!

Best,
Yuxin


weijie guo  于2023年8月7日周一 17:59写道:

Congrats, Hangxiang!

Best regards,

Weijie


Biao Geng  于2023年8月7日周一 17:04写道:

Congrats, Hangxiang!
Best,
Biao Geng


发送自 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Qingsheng Ren 
发送时间: Monday, August 7, 2023 4:23:11 PM
收件人: dev@flink.apache.org 
主题: Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu

Congratulations and welcome aboard, Hangxiang!

Best,
Qingsheng

On Mon, Aug 7, 2023 at 4:19 PM Matthias Pohl 
wrote:

Congratulations, Hangxiang! :)

On Mon, Aug 7, 2023 at 10:01 AM Junrui Lee 
wrote:

Congratulations, Hangxiang!

Best,
Junrui

Yun Tang  于2023年8月7日周一 15:19写道:

Congratulations, Hangxiang!

Best
Yun Tang

From: Danny Cranmer 
Sent: Monday, August 7, 2023 15:11
To: dev 
Subject: Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu

Congrats Hangxiang! Welcome to the team.

Danny.

On Mon, 7 Aug 2023, 08:04 Rui Fan, <1996fan...@gmail.com> wrote:

Congratulations Hangxiang!

Best,
Rui

On Mon, Aug 7, 2023 at 2:58 PM Yuan Mei 
wrote:

On behalf of the PMC, I'm happy to announce Hangxiang Yu as a
new
Flink
Committer.

Hangxiang has been active in the Flink community for more than
1.5
years
and has played an important role in developing and maintaining
State
and
Checkpoint related features/components, including Generic
Incremental
Checkpoints (take great efforts to make the feature
prod-ready).
Hangxiang
is also the main driver of the FLIP-263: Resolving schema
compatibility.

Hangxiang is passionate about the Flink community. Besides the
technical
contribution above, he is also actively promoting Flink: talks
about
Generic
Incremental Checkpoints in Flink Forward and Meet-up. Hangxiang
also
spent
a good amount of time supporting users, participating in
Jira/mailing
list
discussions, and reviewing code.

Please join me in congratulating Hangxiang for becoming a Flink
Committer!

Thanks,
Yuan Mei (on behalf of the Flink PMC)









Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei

2023-08-07 Thread Feifan Wang
Congratulations Yanfei! :)



——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Matt Wang |
| Date | 08/7/2023 19:40 |
| To | dev@flink.apache.org |
| Subject | Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei |
Congratulations Yanfei!


--

Best,
Matt Wang


 Replied Message 
| From | Mang Zhang |
| Date | 08/7/2023 18:56 |
| To |  |
| Subject | Re:Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei |
Congratulations--

Best regards,
Mang Zhang





在 2023-08-07 18:17:58,"Yuxin Tan"  写道:
Congrats, Yanfei!

Best,
Yuxin


weijie guo  于2023年8月7日周一 17:59写道:

Congrats, Yanfei!

Best regards,

Weijie


Biao Geng  于2023年8月7日周一 17:03写道:

Congrats, Yanfei!
Best,
Biao Geng

发送自 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Qingsheng Ren 
发送时间: Monday, August 7, 2023 4:23:52 PM
收件人: dev@flink.apache.org 
主题: Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei

Congratulations and welcome, Yanfei!

Best,
Qingsheng

On Mon, Aug 7, 2023 at 4:19 PM Matthias Pohl 
wrote:

Congratulations, Yanfei! :)

On Mon, Aug 7, 2023 at 10:00 AM Junrui Lee 
wrote:

Congratulations Yanfei!

Best,
Junrui

Yun Tang  于2023年8月7日周一 15:19写道:

Congratulations, Yanfei!

Best
Yun Tang

From: Danny Cranmer 
Sent: Monday, August 7, 2023 15:10
To: dev 
Subject: Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei

Congrats Yanfei! Welcome to the team.

Danny

On Mon, 7 Aug 2023, 08:03 Rui Fan, <1996fan...@gmail.com> wrote:

Congratulations Yanfei!

Best,
Rui

On Mon, Aug 7, 2023 at 2:56 PM Yuan Mei 
wrote:

On behalf of the PMC, I'm happy to announce Yanfei Lei as a new
Flink
Committer.

Yanfei has been active in the Flink community for almost two
years
and
has
played an important role in developing and maintaining State
and
Checkpoint
related features/components, including RocksDB Rescaling
Performance
Improvement and Generic Incremental Checkpoints.

Yanfei also helps improve community infrastructure in many
ways,
including
migrating the Flink Daily performance benchmark to the Apache
Flink
slack
channel. She is the maintainer of the benchmark and has
improved
its
detection stability significantly. She is also one of the major
maintainers
of the FrocksDB Repo and released FRocksDB 6.20.3 (part of
Flink
1.17
release). Yanfei is a very active community member, supporting
users
and
participating
in tons of discussions on the mailing lists.

Please join me in congratulating Yanfei for becoming a Flink
Committer!

Thanks,
Yuan Mei (on behalf of the Flink PMC)









[jira] [Created] (FLINK-32769) PeriodicMaterializationManager pass descriptionFormat with invalid placeholder to MailboxExecutor#execute()

2023-08-07 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32769:
---

 Summary: PeriodicMaterializationManager pass descriptionFormat 
with invalid placeholder to MailboxExecutor#execute()
 Key: FLINK-32769
 URL: https://issues.apache.org/jira/browse/FLINK-32769
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


descriptionFormat in _MailboxExecutor#execute( ThrowingRunnable command, String descriptionFormat, Object... descriptionArgs)_ will 
be used in _String.format()_ which can't accept placeholder like "{}". But 
PeriodicMaterializationManager passed the descriptionFormat with invalid 
placeholder ‘{}’.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32761) Use md5 sum of PhysicalStateHandleID as SharedStateRegistryKey ChangelogStateHandleStreamImpl

2023-08-05 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32761:
---

 Summary: Use md5 sum of PhysicalStateHandleID as 
SharedStateRegistryKey ChangelogStateHandleStreamImpl
 Key: FLINK-32761
 URL: https://issues.apache.org/jira/browse/FLINK-32761
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Feifan Wang


_ChangelogStateHandleStreamImpl#getKey()_ use 
_System.identityHashCode(stateHandle)_ as _SharedStateRegistryKey_ while 
stateHandle is not _FileStateHandle_ or {_}ByteStreamStateHandle{_}. That can 
easily lead to collision, although from the current code path, it only affects 
the test code.

In FLINK-29913 , we use md5 sum of PhysicalStateHandleID as 
SharedStateRegistryKey in IncrementalRemoteKeyedStateHandle, we can reuse this 
method in ChangelogStateHandleStreamImpl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang

2023-07-25 Thread Feifan Wang
Congratulations, Yong Fang!

Best,
Feifan Wang


| |
Feifan Wang
|
|
zoltar9...@163.com
|


 Replied Message 
| From | Yuxin Tan |
| Date | 07/26/2023 10:25 |
| To |  |
| Subject | Re: Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang |
Congratulations, Yong Fang!

Best,
Yuxin


Yanfei Lei  于2023年7月26日周三 10:18写道:

Congratulations!

Best regards,
Yanfei

weijie guo  于2023年7月26日周三 10:10写道:

Congrats, Yong Fang!

Best regards,

Weijie


Danny Cranmer  于2023年7月26日周三 03:34写道:

Congrats and welcome!

Danny.

On Tue, 25 Jul 2023, 16:48 Matthias Pohl, 
wrote:

Congratulations :)

On Tue, Jul 25, 2023 at 5:13 PM Jing Ge 
wrote:

Congrats, Yong Fang!

Best regards,
Jing

On Tue, Jul 25, 2023 at 7:35 PM Yu Li  wrote:

Congrats, Yong!

Best Regards,
Yu


On Tue, 25 Jul 2023 at 18:03, Sergey Nuyanzin <
snuyan...@gmail.com>
wrote:

Congratulations, Yong Fang!

On Tue, Jul 25, 2023 at 7:53 AM ConradJam  于2023年7月25日周二 12:08写道:

Congratulations, Yong Fang!


--

Best regards,
Mang Zhang





在 2023-07-25 10:30:24,"Jark Wu"  写道:
Congratulations, Yong Fang!

Best,
Jark

On Mon, 24 Jul 2023 at 22:11, Wencong Liu <
liuwencle...@163.com

wrote:

Congratulations!

Best,
Wencong Liu















在 2023-07-24 11:03:30,"Paul Lam" 

Re: [DISCUSS] FLIP 333 - Redesign Apache Flink website

2023-07-11 Thread Feifan Wang
+1 , the new design looks more attractive and is well organized

|
|
Feifan Wang
|
|
zoltar9...@163.com
|


 Replied Message 
| From | Leonard Xu |
| Date | 07/11/2023 16:34 |
| To | dev |
| Subject | Re: [DISCUSS] FLIP 333 - Redesign Apache Flink website |
+1 for the redesigning, the new website looks cool.


Best,
Leonard

On Jul 11, 2023, at 7:55 AM, Mohan, Deepthi  wrote:

Hi,

I’m opening this thread to discuss a proposal to redesign the Apache Flink 
website: https://flink.apache.org. The approach and a few initial mockups are 
included in FLIP 333 - Redesign Apache Flink 
website.<https://cwiki.apache.org/confluence/display/FLINK/FLIP-333%3A+Redesign+Apache+Flink+website>

The goal is to modernize the website design to help existing and new users 
easily understand Flink’s value proposition and make Flink attractive to new 
users. As suggested in a previous thread, there are no proposed changes to 
Flink documentation.

I look forward to your feedback and the discussion.

Thanks,
Deepthi




[jira] [Created] (FLINK-32141) SharedStateRegistry print too much info log

2023-05-20 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32141:
---

 Summary: SharedStateRegistry print too much info log
 Key: FLINK-32141
 URL: https://issues.apache.org/jira/browse/FLINK-32141
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.17.0
Reporter: Feifan Wang
 Fix For: 1.17.1
 Attachments: image-2023-05-21-00-26-20-026.png

FLINK-29095 added some log to SharedStateRegistry for trouble shooting. Among 
them, a info log be added when newHandle is equal to the registered one:

[https://github.com/apache/flink/blob/release-1.17.0/flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java#L117]

!image-2023-05-21-00-26-20-026.png|width=775,height=126!

But this case cannot be considered as a possible bug, because 
FsStateChangelogStorage will directly use the FileStateHandle of the previous 
checkpoint instead of PlaceholderStreamStateHandle.

In our tests, JobManager printed so much of this log that useful information 
was overwhelmed.

So I suggest change this log level to trace, WDYT [~Yanfei Lei], [~klion26] ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32135) FRocksDB fix log file create failed caused by file name too long

2023-05-19 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32135:
---

 Summary: FRocksDB fix log file create failed caused by file name 
too long
 Key: FLINK-32135
 URL: https://issues.apache.org/jira/browse/FLINK-32135
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


RocksDB use instance path as log file name when specifying log path, but if 
instance path is too long to exceed filesystem's limit, log file creation will 
fail.

We disable log relocating when RocksDB instance path is too long in 
FLINK-31743, but that's just a hotfix. This ticket proposal save this problem 
on FrocksDB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32130) previous checkpoint will be broke by the subsequent incremental checkpoint

2023-05-18 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-32130:
---

 Summary: previous checkpoint will be broke by the subsequent 
incremental checkpoint
 Key: FLINK-32130
 URL: https://issues.apache.org/jira/browse/FLINK-32130
 Project: Flink
  Issue Type: Bug
Reporter: Feifan Wang


Currently, _SharedStateRegistryImpl_ will discard old one while register new 
state to same key:
{code:java}
// Old entry is not in a confirmed checkpoint yet, and the new one differs.
// This might result from (omitted KG range here for simplicity):
// 1. Flink recovers from a failure using a checkpoint 1
// 2. State Backend is initialized to UID xyz and a set of SST: { 01.sst }
// 3. JM triggers checkpoint 2
// 4. TM sends handle: "xyz-002.sst"; JM registers it under "xyz-002.sst"
// 5. TM crashes; everything is repeated from (2)
// 6. TM recovers from CP 1 again: backend UID "xyz", SST { 01.sst }
// 7. JM triggers checkpoint 3
// 8. TM sends NEW state "xyz-002.sst"
// 9. JM discards it as duplicate
// 10. checkpoint completes, but a wrong SST file is used
// So we use a new entry and discard the old one:
LOG.info(
"Duplicated registration under key {} of a new state: {}. "
+ "This might happen during the task failover if state backend 
creates different states with the same key before and after the failure. "
+ "Discarding the OLD state and keeping the NEW one which is 
included into a completed checkpoint",
registrationKey,
newHandle);
scheduledStateDeletion = entry.stateHandle;
entry.stateHandle = newHandle; {code}
But if _execution.checkpointing.max-concurrent-checkpoints_ > 1, the following 
case will fail (take _RocksDBStateBackend_ as an example):
 # cp1 trigger: 1.sst be uploaded to file-1, and register <1.sst,file-1>, cp1 
reference file-1
 # cp1 is not yet complete, cp2 trigger: 1.sst be uploaded to file-2, and try 
register <1.sst,file-2>. SharedStateRegistry discard file-1
 # cp1 completed and cp2 failed, but the cp1 is broken (file-1 has be deleted)

I think we should allow register multi state object to same key, WDYT 
[~pnowojski], [~roman]  ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31900) Fix some typo in java doc, comments and assertion message

2023-04-23 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-31900:
---

 Summary: Fix some typo in java doc, comments and assertion message
 Key: FLINK-31900
 URL: https://issues.apache.org/jira/browse/FLINK-31900
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Feifan Wang


As the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31414) exceptions in the alignment timer are ignored

2023-03-13 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-31414:
---

 Summary: exceptions in the alignment timer are ignored
 Key: FLINK-31414
 URL: https://issues.apache.org/jira/browse/FLINK-31414
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Reporter: Feifan Wang


Alignment timer task in alternating aligned checkpoint run as a future task in 
mailbox thread, leads to exceptions 
([SingleCheckpointBarrierHandler#registerAlignmentTimer()|https://github.com/apache/flink/blob/65ab8e820a3714d2134dfb4c9772a10c998bd45a/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/checkpointing/SingleCheckpointBarrierHandler.java#L327])
 are ignored.

see : 
[BarrierAlignmentUtil#createRegisterTimerCallback()|https://github.com/apache/flink/blob/65ab8e820a3714d2134dfb4c9772a10c998bd45a/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/checkpointing/BarrierAlignmentUtil.java#L50]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31139) not upload empty state changelog file

2023-02-20 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-31139:
---

 Summary: not upload empty state changelog file
 Key: FLINK-31139
 URL: https://issues.apache.org/jira/browse/FLINK-31139
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang
 Fix For: 1.16.2
 Attachments: image-2023-02-20-19-51-34-397.png

h1. Problem

*_BatchingStateChangeUploadScheduler_* will upload many empty changelog files 
(file size == 1  and only contains compressed flag).

!image-2023-02-20-19-51-34-397.png|width=1062,height=188!

These files are not referenced by any checkpoints, are not cleaned up, and 
become more numerous as the job runs. Taking our big job as an example, 2292 
such files were generated within 7 hours. It only takes about 4 months and the 
number of files in the changelog directory will exceed a million.
h1. Problem causes

This problem is caused by *_BatchingStateChangeUploadScheduler#drainAndSave_* 
not checking whether the task collection is empty. The data in the scheduled 
queue may have been uploaded when the 
_*BatchingStateChangeUploadScheduler#drainAndSave*_ method is executed.

 

So we should check whether the task collection is empty in 
*_BatchingStateChangeUploadScheduler#drainAndSave_* . WDYT [~roman] , [~Yanfei 
Lei] ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30792) clean up not uploaded state changes after materialization complete

2023-01-25 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-30792:
---

 Summary: clean up not uploaded state changes after materialization 
complete
 Key: FLINK-30792
 URL: https://issues.apache.org/jira/browse/FLINK-30792
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.16.0
Reporter: Feifan Wang


We should clean up not uploaded state changes after materialization completed, 
otherwise it will cause (status quo) : 
 # subsequent checkpoints contain wrong state changes which before completed 
materialization
 # FileNotFound exception may occur when recovering from the above problematic 
checkpoint, because the state change files before completed materialization may 
have been deleted with the checkpoint subsuming.

Since state changes before completed materialization in 
FsStateChangelogWriter#notUploaded will not be used in any subsequent 
checkpoint, I suggest clean up it while handle materialization result. 

How do you think about this ? [~ym] , [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30561) ChangelogStreamHandleReaderWithCache cause FileNotFoundException

2023-01-04 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-30561:
---

 Summary: ChangelogStreamHandleReaderWithCache cause 
FileNotFoundException
 Key: FLINK-30561
 URL: https://issues.apache.org/jira/browse/FLINK-30561
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.16.0
Reporter: Feifan Wang


When a job with state changelog enabled continues to restart, the following 
exceptions may occur :
{code:java}
java.lang.RuntimeException: java.io.FileNotFoundException: 
/data1/hadoop/yarn/nm-local-dir/usercache/hadoop-rt/appcache/application_1671689962742_192/dstl-cache-file/dstl6215344559415829831.tmp
 (No such file or directory)
    at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321)
    at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:87)
    at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.hasNext(StateChangelogHandleStreamHandleReader.java:69)
    at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.readBackendHandle(ChangelogBackendRestoreOperation.java:107)
    at 
org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:78)
    at 
org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:94)
    at 
org.apache.flink.state.changelog.AbstractChangelogStateBackend.createKeyedStateBackend(AbstractChangelogStateBackend.java:136)
    at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:336)
    at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
    at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
    at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:353)
    at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:165)
    at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:265)
    at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:726)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:702)
    at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:669)
    at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
    at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:904)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: 
/data1/hadoop/yarn/nm-local-dir/usercache/hadoop-rt/appcache/application_1671689962742_192/dstl-cache-file/dstl6215344559415829831.tmp
 (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at 
org.apache.flink.changelog.fs.ChangelogStreamHandleReaderWithCache.openAndSeek(ChangelogStreamHandleReaderWithCache.java:158)
    at 
org.apache.flink.changelog.fs.ChangelogStreamHandleReaderWithCache.openAndSeek(ChangelogStreamHandleReaderWithCache.java:95)
    at 
org.apache.flink.changelog.fs.StateChangeIteratorImpl.read(StateChangeIteratorImpl.java:42)
    at 
org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
    ... 21 more {code}
*Problem causes:*
 # *_ChangelogStreamHandleReaderWithCache_* use RefCountedFile manager local 
cache file. The reference count is incremented when the input stream is opened 
from the cache file, and decremented by one when the input stream is closed. So 
the input stream must be closed and only once.
 # _*StateChangelogHandleStreamHandleReader#getChanges()*_ may cause the input 
stream to be closed twice. This happens when changeIterator.read(tuple2.f0, 
tuple2.f1) throws an exception (for example, when the task is canceled for 
other reasons during the restore process) the current state change iterator 
will be closed twice.

{code:java}
private void advance() {
while (!current.h

[jira] [Created] (FLINK-29822) Fix wrong description in comments of StreamExecutionEnvironment#setMaxParallelism()

2022-10-31 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-29822:
---

 Summary: Fix wrong description in comments of 
StreamExecutionEnvironment#setMaxParallelism()
 Key: FLINK-29822
 URL: https://issues.apache.org/jira/browse/FLINK-29822
 Project: Flink
  Issue Type: Bug
Reporter: Feifan Wang


The upper limit (inclusive) of max parallelism is Short.MAX_VALUE + 1, not 
Short.MAX_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29526) Java doc mistake in SequenceNumberRange#contains()

2022-10-05 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-29526:
---

 Summary: Java doc mistake in SequenceNumberRange#contains()
 Key: FLINK-29526
 URL: https://issues.apache.org/jira/browse/FLINK-29526
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang
 Attachments: image-2022-10-06-10-50-16-927.png

!image-2022-10-06-10-50-16-927.png|width=554,height=106!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29244) Add metric lastMaterializationDuration to ChangelogMaterializationMetricGroup

2022-09-09 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-29244:
---

 Summary: Add metric lastMaterializationDuration to  
ChangelogMaterializationMetricGroup
 Key: FLINK-29244
 URL: https://issues.apache.org/jira/browse/FLINK-29244
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Feifan Wang


Materialization duration can help us evaluate the efficiency of materialization 
and the impact on the job.

 

How do you think about ? [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re:[DISCUSS] Introduce multi delete API to Flink's FileSystem class

2022-06-30 Thread Feifan Wang
Thanks a lot for the proposal  @Yun Tang ! It sounds great and I can't find any 
reason not to make this improvement.


——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Yun Tang |
| Date | 06/30/2022 16:56 |
| To | dev@flink.apache.org |
| Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem class |
Hi guys,

As more and more teams move to cloud-based environments. Cloud object storage 
has become the factual technical standard for big data ecosystems.
From our experience, the performance of writing/deleting objects in object 
storage could vary in each call, the FLIP of changelog state-backend had ever 
taken experiments to verify the performance of writing the same data with multi 
times [1], and it proves that p999 latency could be 8x than p50 latency. This 
is also true for delete operations.

Currently, after introducing the checkpoint backpressure mechanism[2], the 
newly triggered checkpoint could be delayed due to not cleaning checkpoints as 
fast as possible [3].
Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting folder 
API to speed up the procedure with incremental checkpoints[4].
This is extremely obvious in cloud object storage, and all most all object 
storage SDKs have multi-delete API to accelerate the performance, e.g. AWS S3 
[5], Aliyun OSS [6], and Tencentyun COS [7].
A simple experiment shows that deleting 1000 objects with each 5MB size, will 
cost 39494ms with for-loop single delete operations, and the result will drop 
to 1347ms if using multi-delete API in Tencent Cloud.

However, Flink's FileSystem API refers to the HDFS's FileSystem API and lacks 
such a multi-delete API, which is somehow outdated currently in cloud-based 
environments.
Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] class 
and file systems that do not support such a multi-delete feature will roll back 
to a for-loop single delete.
By doing so, we can at least accelerate the speed of discarding checkpoints in 
cloud environments.

WDYT?


[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency
[2] https://issues.apache.org/jira/browse/FLINK-17073
[3] https://issues.apache.org/jira/browse/FLINK-26590
[4] 
https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315
[5] 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html
[6] 
https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax
[7] 
https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch
[8] 
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java


Best
Yun Tang



[jira] [Created] (FLINK-28178) Show the delegated StateBackend and whether changelog is enabled in the UI

2022-06-21 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-28178:
---

 Summary: Show the delegated StateBackend and whether changelog is 
enabled in the UI
 Key: FLINK-28178
 URL: https://issues.apache.org/jira/browse/FLINK-28178
 Project: Flink
  Issue Type: Improvement
Reporter: Feifan Wang


If changelog is enabled, StateBackend shown in Web UI is always 
'ChangelogStateBackend'. I think ChangelogStateBackend should not expose to 
user, we should show the delegated StateBackend in this place. And We should 
add add a row to indicate whether changelog is enabled.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28172) Scatter dstl files into separate directories by job id

2022-06-21 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-28172:
---

 Summary: Scatter dstl files into separate directories by job id
 Key: FLINK-28172
 URL: https://issues.apache.org/jira/browse/FLINK-28172
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.15.0
Reporter: Feifan Wang


In the current implementation of {_}FsStateChangelogStorage{_}, dstl files from 
all jobs are put into the same directory (configured via 
{_}dstl.dfs.base-path{_}). Everything is fine if it's a filesystem like S3.But 
if it is a file system like hadoop, there will be some problems.

First, there may be an upper limit to the number of files in a single 
directory. Increasing this threshold will greatly reduce the performance of the 
distributed file system.

Second, dstl file management becomes difficult because the user cannot tell 
which job the dstl file belongs to, especially when the retained checkpoint is 
turned on.
h3. Propose
 # create a subdirectory named with the job id under the _dstl.dfs.base-path_ 
directory when the job starts
 # all dstl files upload to the subdirectory

( Going a step further, we can even create two levels of subdirectories under 
the _dstl.dfs.base-path_ directory, like _base-path/\{jobId}/dstl ._ This way, 
if the user configures the same dstl.dfs.base-path as state.checkpoints.dir, 
all files needed for job recovery will be in the same directory and well 
organized. )



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [ANNOUNCE] New Apache Flink Committers: Qingsheng Ren, Shengkai Fang

2022-06-20 Thread Feifan Wang
Congratulations, Qingsheng and ShengKai.



Best,
Feifan


——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Zakelly Lan |
| Date | 06/20/2022 16:51 |
| To |  |
| Subject | Re: [ANNOUNCE] New Apache Flink Committers: Qingsheng Ren, Shengkai 
Fang |
Congrats, Qingsheng and Shengkai.

Best,
Zakelly

On Mon, Jun 20, 2022 at 4:50 PM Geng Biao  wrote:

Congrats to Qingsheng and Shengkai!
Best,
Biao Geng

获取 Outlook for iOS<https://aka.ms/o0ukef>

发件人: Zhilong Hong 
发送时间: Monday, June 20, 2022 4:05:48 PM
收件人: dev@flink.apache.org 
抄送: Qingsheng Ren ; Shengkai Fang 
主题: Re: [ANNOUNCE] New Apache Flink Committers: Qingsheng Ren, Shengkai
Fang

Congratulations, Qingsheng and ShengKai!

Best,
Zhilong.

On Mon, Jun 20, 2022 at 4:00 PM Lijie Wang 
wrote:

Congratulations, Qingsheng and ShengKai.

Best,
Lijie

Paul Lam  于2022年6月20日周一 15:58写道:

Congrats, Qingsheng and Shengkai!

Best,
Paul Lam

2022年6月20日 15:57,Martijn Visser  写道:

Congratulations to both of you, this is very much deserved!

Op ma 20 jun. 2022 om 09:53 schreef Jark Wu :

Hi everyone,

On behalf of the PMC, I'm very happy to announce two new Flink
committers:
Qingsheng Ren and Shengkai Fang.

Qingsheng is the core contributor and maintainer of the Kafka
connector.
He continuously improved the existing connectors, debugged many
connector
testability issues, and worked on the connector testing framework.
Recently,
he is driving the work of FLIP-221 (caching lookup connector), which
is
crucial for SQL connectors.

Shengkai has been continuously contributing to the Flink project for
two
years.
He mainly works on Flink SQL parts and drives several important
FLIPs,
e.g. FLIP-149 (upsert-kafka), FLIP-163 (SQL CLI Improvements),
FLIP-91 (SQL Gateway), and FLIP-223 (HiveServer2 Endpoint).
He is very active and helps many users on the mailing list.

Please join me in welcoming them as committers!

Cheers,
Jark Wu







Re: [DISCUSS ] Make state.backend.incremental as true by default

2022-06-15 Thread Feifan Wang
Thanks for bringing this up.
Strongly +1




——
Name: Feifan Wang
Email: zoltar9...@163.com


 Replied Message 
| From | Yuan Mei |
| Date | 06/15/2022 11:41 |
| To | dev ,
 |
| Subject | Re: [DISCUSS ] Make state.backend.incremental as true by default |
Thanks for bringing this up.

I am +1 on making incremental checkpoints by default for RocksDB, but not
universally for all state backends.

Besides being widely used in prod, enabling incremental checkpoint for
RocksDB by default is also a pre-requisite when enabling task-local by
default FLINK-15507 <https://issues.apache.org/jira/browse/FLINK-15507>

The incremental checkpoint for the hashmap statebackend is under review
right now. CC @ro...@ververica.com  , which is not a
good idea being enabled by default in the first version.

Best,

Yuan

On Tue, Jun 14, 2022 at 7:33 PM Jiangang Liu 
wrote:

+1 for the suggestion. We have use the incremental checkpoint in our
production for a long time.

Hangxiang Yu  于2022年6月14日周二 15:41写道:

+1
It's basically enabled in most scenarios in production environments.
For HashMapStateBackend, it will adopt a full checkpoint even if we
enable
incremental checkpoint. It will also support incremental checkpoint after
[1]. It's compatible.
BTW, I think we may also need to improve the documentation of incremental
checkpoints which users usually ask. There are some tickets like [2][3].

Best,
Hangxiang.

[1] https://issues.apache.org/jira/browse/FLINK-21648
[2] https://issues.apache.org/jira/browse/FLINK-22797
[3] https://issues.apache.org/jira/browse/FLINK-7449

On Mon, Jun 13, 2022 at 7:48 PM Rui Fan <1996fan...@gmail.com> wrote:

Strongly +1

Best,
Rui Fan

On Mon, Jun 13, 2022 at 7:35 PM Martijn Visser <
martijnvis...@apache.org

wrote:

BTW, from my knowledge, nothing would happen for
HashMapStateBackend,
which does not support incremental checkpoint yet, when enabling
incremental checkpoints.

Thanks Yun, if no errors would occur then definitely +1 to enable it
by
default

Op ma 13 jun. 2022 om 12:42 schreef Alexander Fedulov <
alexan...@ververica.com>:

+1

From my experience, it is actually hard to come up with use cases
where
incremental checkpoints should explicitly not be enabled with the
RocksDB
state backend. If the state is so small that the full snapshots do
not
have any negative impact, one should consider using
HashMapStateBackend
anyway.

Best,
Alexander Fedulov


On Mon, Jun 13, 2022 at 12:26 PM Jing Ge 
wrote:

+1

Glad to see the kickoff of this discussion. Thanks Lihe for
driving
this!

We have actually already discussed it internally a few months
ago.
After
considering some corner cases, all agreed on enabling the
incremental
checkpoint as default.

Best regards,
Jing

On Mon, Jun 13, 2022 at 12:17 PM Yun Tang 
wrote:

Strongly +1 for making incremental checkpoints as default. Many
users
have
ever been asking why this configuration is not enabled by
default.

BTW, from my knowledge, nothing would happen for
HashMapStateBackend,
which does not support incremental checkpoint yet, when
enabling
incremental checkpoints.


Best
Yun Tang

From: Martijn Visser 
Sent: Monday, June 13, 2022 18:05
To: dev@flink.apache.org 
Subject: Re: [DISCUSS ] Make state.backend.incremental as true
by
default

Hi Lihe,

What happens if we enable incremental checkpoints by default
while
the
used
memory backend is HashMapStateBackend, which doesn't support
incremental
checkpoints?

Best regards,

Martijn

Op ma 13 jun. 2022 om 11:59 schreef Lihe Ma :

Hi, Everyone,

I would like to open a discussion on setting incremental
checkpoint
as
default behavior.

Currently, the configuration of state.backend.incremental is
set
as
false
by default. Incremental checkpoint has been adopted widely in
industry
community for many years , and it is also well-tested from
the
feedback
in
the community discussion. Incremental checkpointing is more
light-weighted:
shorter checkpoint duration, less uploaded data and less
resource
consumption.

In terms of backward compatibility, enable incremental
checkpointing
would
not make any data loss no matter restoring from a full
checkpoint/savepoint
or an incremental checkpoint.

FLIP-193 (Snapshot ownership)[1] has been released in 1.15,
incremental
checkpoint no longer depends on a previous restored
checkpoint
in
default
NO_CLAIM mode, which makes the checkpoint lineage much
cleaner,
it
is a
good chance to change the configuration
state.backend.incremental
to
true
as default.

Thus, based on the above discussion, I suggest to make
state.backend.incremental as true by default. What do you
think
of
this
proposal?

[1]







https://cwiki.apache.org/confluence/display/FLINK/FLIP-193%3A+Snapshots+ownership

Best regards,
Lihe Ma










[jira] [Created] (FLINK-27841) RocksDB cache miss increase in 1.15

2022-05-30 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-27841:
---

 Summary: RocksDB cache miss increase in 1.15
 Key: FLINK-27841
 URL: https://issues.apache.org/jira/browse/FLINK-27841
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.15.0
Reporter: Feifan Wang
 Attachments: image-2022-05-31-00-21-49-338.png, 
image-2022-05-31-00-22-45-123.png

I run same job with 1.12.2 and 1.15.0 , find that cpu busy much higher than 
1.12. After careful comparison, I find higher cache miss in 1.15. But block 
cache size is same. Is this as expected ?

I know RocksDB version in 1.15 is 6.20.3 and 1.12 is 5.17.2, is this just 
caused by different versions of RocksDB ?

!image-2022-05-31-00-21-49-338.png|width=599,height=146!

!image-2022-05-31-00-22-45-123.png|width=594,height=175!

test information:

job type : regular join

parallelism : 160

TaskManager memory : 32G

num of TaskManager : 10



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27734) Not showing checkpoint interval properly in WebUI when checkpoint is disabled

2022-05-22 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-27734:
---

 Summary: Not showing checkpoint interval properly  in WebUI when 
checkpoint is disabled
 Key: FLINK-27734
 URL: https://issues.apache.org/jira/browse/FLINK-27734
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Reporter: Feifan Wang
 Attachments: image-2022-05-22-23-42-46-365.png

Not showing checkpoint interval properly  in WebUI when checkpoint is disabled

!image-2022-05-22-23-42-46-365.png|width=1019,height=362!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27187) The attemptsPerUpload metric may be lower than it actually is

2022-04-11 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-27187:
---

 Summary: The attemptsPerUpload metric may be lower than it 
actually is
 Key: FLINK-27187
 URL: https://issues.apache.org/jira/browse/FLINK-27187
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


The attemptsPerUpload metric in ChangelogStorageMetricGroup indicate 
distributions of number of attempts per upload.

In the current implementation, each successful attempt try to update 
attemptsPerUpload with its attemptNumber.

But consider this case: 
 # attempt 1 timeout, then schedule attempt 2
 # attempt 1 completed before attempt 2 and update attemptsPerUpload with 1

In fact there are two attempts, but attemptsPerUpload updated with 1.

So, I think we should add "actionAttemptsCount" to 
RetryExecutor.RetriableActionAttempt, this field shared across all attempts to 
execute the same upload action representing the number of upload attempts. And 
completed attempt should use this field update attemptsPerUpload.

 

How do you think about ? [~ym] , [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27155) Reduce multiple reads to the same Changelog file in the same taskmanager during restore

2022-04-09 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-27155:
---

 Summary: Reduce multiple reads to the same Changelog file in the 
same taskmanager during restore
 Key: FLINK-27155
 URL: https://issues.apache.org/jira/browse/FLINK-27155
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / State Backends
Reporter: Feifan Wang


h3. Background

In the current implementation, State changes of different operators in the same 
taskmanager may be written to the same changelog file, which effectively 
reduces the number of files and requests to DFS.

But on the other hand, the current implementation also reads the same changelog 
file multiple times on recovery. More specifically, the number of times the 
same changelog file is accessed is related to the number of ChangeSets 
contained in it. And since each read needs to skip the preceding bytes, this 
network traffic is also wasted.

The result is a lot of unnecessary request to DFS when there are multiple slots 
and keyed state in the same taskmanager.
h3. Proposal

We can reduce multiple reads to the same changelog file in the same taskmanager 
during restore.

One possible approach is to read the changelog file all at once and cache it in 
memory or local file for a period of time when reading the changelog file.

I think this could be a subtask of [v2 FLIP-158: Generalized incremental 
checkpoints|https://issues.apache.org/jira/browse/FLINK-25842] .

Hi [~ym] , [~roman]  how do you think about ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27105) wrong metric type

2022-04-06 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-27105:
---

 Summary: wrong metric type
 Key: FLINK-27105
 URL: https://issues.apache.org/jira/browse/FLINK-27105
 Project: Flink
  Issue Type: Bug
Reporter: Feifan Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26799) StateChangeFormat#read not seek to offset correctly

2022-03-22 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-26799:
---

 Summary: StateChangeFormat#read not seek to offset correctly
 Key: FLINK-26799
 URL: https://issues.apache.org/jira/browse/FLINK-26799
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


StateChangeFormat#read must seek to offset before read, current implement as 
follows :

 
{code:java}
FSDataInputStream stream = handle.openInputStream();
DataInputViewStreamWrapper input = wrap(stream);
if (stream.getPos() != offset) {
LOG.debug("seek from {} to {}", stream.getPos(), offset);
input.skipBytesToRead((int) offset);
}{code}
But the if condition is incorrect, stream.getPos() return the position of 
underlying stream which is different from position of input.

By the way, because of wrapped by BufferedInputStream, position of underlying 
stream always at n*bufferSize or the end of file. 

Actually, input is aways at position 0 at beginning, so I think we can seek to 
the offset directly.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26766) Mistake in ChangelogStateHandleStreamImpl#getIntersection

2022-03-21 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-26766:
---

 Summary: Mistake in ChangelogStateHandleStreamImpl#getIntersection
 Key: FLINK-26766
 URL: https://issues.apache.org/jira/browse/FLINK-26766
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Feifan Wang


 

Maybe mistake in ChangelogStateHandleStreamImpl :
{code:java}
public KeyedStateHandle getIntersection(KeyGroupRange keyGroupRange) {
KeyGroupRange offsets = keyGroupRange.getIntersection(keyGroupRange);
// ..
} {code}
I guess should be :

KeyGroupRange offsets = 
{color:#de350b}this{color}.keyGroupRange.getIntersection(keyGroupRange);

 

Hi [~roman] , can you help confirm that ?

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24562) YarnResourceManagerDriverTest should not use ContainerStatusPBImpl.newInstance

2021-10-14 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-24562:
---

 Summary: YarnResourceManagerDriverTest should not use 
ContainerStatusPBImpl.newInstance
 Key: FLINK-24562
 URL: https://issues.apache.org/jira/browse/FLINK-24562
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / YARN
Affects Versions: 1.14.0
Reporter: Feifan Wang


In YarnResourceManagerDriverTest, we create ContainerStatus with the static 
method ContainerStatusPBImpl{{.newInstance}}, which is annotated as private and 
unstable.

Although this method is still available in the latest version of yarn, some 
third-party versions of yarn may modify it. In fact, this method was modified 
in the internal version provided by our yarn team, which caused flink-1.14.0 to 
fail to compile.

Moreover, there is already an org.apache.flink.yarn.TestingContainerStatus, I 
think we should use it directly.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24384) Count checkpoints failed in trigger phase into numberOfFailedCheckpoints

2021-09-27 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-24384:
---

 Summary: Count checkpoints failed in trigger phase into 
numberOfFailedCheckpoints
 Key: FLINK-24384
 URL: https://issues.apache.org/jira/browse/FLINK-24384
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Reporter: Feifan Wang


h1. *Problem*

In current implementation, checkpoints failed in trigger phase do not count 
into metric 'numberOfFailedCheckpoints'. Such that users can not aware 
checkpoint stoped by this metric.

As lang as users can use rules like _*'numberOfCompletedCheckpoints' not 
increase in some minutes past*_ (maybe checkpoint interval + timeout) for 
alerting, but I think it is ambages and can not alert timely.

 
h1. *Proposal*

As the title, count checkpoints failed in trigger phase into 
'numberOfFailedCheckpoints'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24274) Wrong parameter order in documentation of State Processor API

2021-09-13 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-24274:
---

 Summary: Wrong parameter order in documentation of State Processor 
API 
 Key: FLINK-24274
 URL: https://issues.apache.org/jira/browse/FLINK-24274
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Feifan Wang
 Attachments: image-2021-09-14-02-09-44-334.png, 
image-2021-09-14-02-11-12-034.png

Wrong order of parameters path and stateBackend in example code of [State 
Processor Api # 
modifying-savepoints|https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/libs/state_processor_api/#modifying-savepoints]

!image-2021-09-14-02-09-44-334.png|width=489,height=126!

!image-2021-09-14-02-11-12-034.png|width=478,height=222!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24159) document of entropy injection may mislead users

2021-09-05 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-24159:
---

 Summary: document of entropy injection may mislead users
 Key: FLINK-24159
 URL: https://issues.apache.org/jira/browse/FLINK-24159
 Project: Flink
  Issue Type: Improvement
  Components: Documentation, Runtime / Checkpointing
Reporter: Feifan Wang


FLINK-9061 incroduce entropy inject to s3 path for better scalability, but in 
document of 
[entropy-injection-for-s3-file-systems|https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/filesystems/s3/#entropy-injection-for-s3-file-systems]
 use a example with checkpoint directory 
"{color:#FF}s3://my-bucket/checkpoints/_entropy_/dashboard-job/{color}", 
with this configuration every checkpoint key will still start with constant 
checkpoints/ prefix which actually reduces scalability.

Thanks to dmtolpeko for describing this issue in his blog ( 
[flink-and-s3-entropy-injection-for-checkpoints 
|http://cloudsqale.com/2021/01/02/flink-and-s3-entropy-injection-for-checkpoints/]).
h3. Proposal

alter the checkpoint directory in document of 
[entropy-injection-for-s3-file-systems|https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/filesystems/s3/#entropy-injection-for-s3-file-systems]
 to "{color:#FF}s3://my-bucket/_entropy_/checkpoints/dashboard-job/{color}" 
(make entropy key at start of keys).

 

If this proposal is appropriate, I am glad to submit a PR to modify the 
document here. Any other ideas for this ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24149) Make checkpoint relocatable

2021-09-03 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-24149:
---

 Summary: Make checkpoint relocatable
 Key: FLINK-24149
 URL: https://issues.apache.org/jira/browse/FLINK-24149
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Reporter: Feifan Wang


h3. Backgroud

FLINK-5763 proposal make savepoint relocatable, checkpoint has similar 
requirements. For example, to migrate jobs to other HDFS clusters, although it 
can be achieved through a savepoint, but we prefer to use persistent 
checkpoints, especially RocksDBStateBackend incremental checkpoints have better 
performance than savepoint during snapshot and restore.

 

FLINK-8531 standardized directory layout :

 
{code:java}
/user-defined-checkpoint-dir
|
+ 1b080b6e710aabbef8993ab18c6de98b (job's ID)
|
+ --shared/
+ --taskowned/
+ --chk-1/
+ --chk-2/
+ --chk-3/
...
{code}
 * State backend will create a subdirectory with the job's ID that will contain 
the actual checkpoints, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/
 * Each checkpoint individually will store all its files in a subdirectory that 
includes the checkpoint number, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/chk-3/
 * Files shared between checkpoints will be stored in the shared/ directory in 
the same parent directory as the separate checkpoint directory, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/shared/
 * Similar to shared files, files owned strictly by tasks will be stored in the 
taskowned/ directory in the same parent directory as the separate checkpoint 
directory, such as: 
user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/taskowned/

h3. Proposal

Since the individually checkpoint directory does not contain complete state 
data, we cannot make it relocatable, but its parent directory can. The only 
work left is make the metadata file references relative file paths.

I proposal make these changes to _*FsCheckpointStateOutputStream*_ :
 * introduce _*checkpointDirectory*_ field
 * introduce *_entropyInjecting_* field
 * *_closeAndGetHandle()_* return _*RelativeFileStateHandle*_ with relative 
path** base on _*checkpointDirectory*_ (except entropy injecting file system)

[~yunta], [~trohrmann] , I verified this in our environment , and I will submit 
a pull request to accomplish this feature. Please help evaluate whether it is 
appropriate.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23949) first incremental checkpoint after a savepoint will degenerate into a full checkpoint

2021-08-24 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-23949:
---

 Summary: first incremental checkpoint after a savepoint will 
degenerate into a full checkpoint
 Key: FLINK-23949
 URL: https://issues.apache.org/jira/browse/FLINK-23949
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.13.2, 1.12.5, 1.11.4
Reporter: Feifan Wang
 Attachments: image-2021-08-25-00-59-05-779.png

In RocksIncrementalSnapshotStrategy we will record the uploaded rocksdb files 
corresponding to the checkpoint id,and clean it in 
_CheckpointListener#notifyCheckpointComplete ._
{code:java}
@Override
public void notifyCheckpointComplete(long completedCheckpointId) {
synchronized (materializedSstFiles) {
if (completedCheckpointId > lastCompletedCheckpointId) {
materializedSstFiles
.keySet()
.removeIf(checkpointId -> checkpointId < 
completedCheckpointId);
lastCompletedCheckpointId = completedCheckpointId;
}
}
}{code}
 

This works well without savepoint, but when a savepoint is completed, it will 
clean up the _materializedSstFiles_ of the previous checkpoint. It leads to the 
first checkpoint after the savepoint must upload all files in rocksdb.

!image-2021-08-25-00-59-05-779.png|width=1640,height=225!

Solving the problem is also very simple, I propose to change 
CheckpointListener#notifyCheckpointComplete to the following form :

 
{code:java}
@Override
public void notifyCheckpointComplete(long completedCheckpointId) {
synchronized (materializedSstFiles) {
if (completedCheckpointId > lastCompletedCheckpointId
&& 
materializedSstFiles.keySet().contains(completedCheckpointId)) {
materializedSstFiles
.keySet()
.removeIf(checkpointId -> checkpointId < 
completedCheckpointId);
lastCompletedCheckpointId = completedCheckpointId;
}
}
}
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21986) taskmanager native memory not release timely after restart

2021-03-25 Thread Feifan Wang (Jira)
Feifan Wang created FLINK-21986:
---

 Summary: taskmanager native memory not release timely after restart
 Key: FLINK-21986
 URL: https://issues.apache.org/jira/browse/FLINK-21986
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.12.1
 Environment: flink version:1.12.1
run :yarn session
job type:mock source -> regular join
 
checkpoint interval: 3m
Taskmanager memory : 16G
 
Reporter: Feifan Wang
 Attachments: image-2021-03-25-15-53-44-214.png, 
image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, 
image-2021-03-26-11-47-21-388.png

I run a regular join job with flink_1.12.1 , and find taskmanager native memory 
not release timely after restart cause by exceeded checkpoint tolerable failure 
threshold.

*problem job information:*
 # job first restart cause by exceeded checkpoint tolerable failure threshold.
 # then taskmanager be killed by yarn many times
 # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
!image-2021-03-25-15-53-44-214.png|width=496,height=103!
 # nonheap size increase after restart,but still under 160M.
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
 # taskmanager process memory increase 3-4G after restart(this figure show one 
of taskmanager)
!image-2021-03-25-16-07-29-083.png|width=493,height=107!

*my guess:*

 

[RocksDB 
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
 mentioned :Many of the Java Objects used in the RocksJava API will be backed 
by C++ objects for which the Java Objects have ownership. As C++ has no notion 
of automatic garbage collection for its heap in the way that Java does, we must 
explicitly free the memory used by the C++ objects when we are finished with 
them.

 

So, is it possible that RocksDBStateBackend not call 
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

*I make a change:*

        Actively call System.gc() and System.runFinalization() every minute.

 *And run this test again:*
 # taskmanager process memory no obvious increase
!image-2021-03-26-11-46-06-828.png|width=495,height=93!
 # job run for several days,and restart many times,but no taskmanager killed by 
yarn like before



*Summary:*
 # first,there is some native memory can not release timely after restart in 
this situation
 # I guess it maybe RocksDB C++ object,but I hive not check it from source code 
of RocksDBStateBackend

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)