[jira] [Created] (FLINK-22022) Reduce the ExecNode scan scope to improve performance when converting json plan to ExecNodeGraph
godfrey he created FLINK-22022: -- Summary: Reduce the ExecNode scan scope to improve performance when converting json plan to ExecNodeGraph Key: FLINK-22022 URL: https://issues.apache.org/jira/browse/FLINK-22022 Project: Flink Issue Type: Sub-task Components: Table SQL / Planner Reporter: godfrey he Fix For: 1.13.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22021) PushFilterIntoLegacyTableSourceScanRule fails to deal with INTERVAL types
Caizhi Weng created FLINK-22021: --- Summary: PushFilterIntoLegacyTableSourceScanRule fails to deal with INTERVAL types Key: FLINK-22021 URL: https://issues.apache.org/jira/browse/FLINK-22021 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Caizhi Weng Fix For: 1.13.0 Add the following test case to {{PushFilterIntoLegacyTableSourceScanRuleTest}} to reproduce this bug: {code:scala} @Test def testWithInterval(): Unit = { val schema = TableSchema .builder() .field("a", DataTypes.STRING) .field("b", DataTypes.STRING) .build() val data = List(Row.of("2021-03-30 10:00:00", "2021-03-30 11:00:00")) TestLegacyFilterableTableSource.createTemporaryTable( util.tableEnv, schema, "MTable", isBounded = true, data, List("a", "b")) util.verifyRelPlan( """ |SELECT * FROM MTable |WHERE | TIMESTAMPADD(HOUR, 1, TO_TIMESTAMP(a)) >= TO_TIMESTAMP(b) | OR | TIMESTAMPADD(YEAR, 1, TO_TIMESTAMP(b)) >= TO_TIMESTAMP(a) |""".stripMargin) } {code} The exception stack is {code} org.apache.flink.table.api.ValidationException: Data type 'INTERVAL SECOND(3) NOT NULL' with conversion class 'java.time.Duration' does not support a value literal of class 'java.math.BigDecimal'. at org.apache.flink.table.expressions.ValueLiteralExpression.validateValueDataType(ValueLiteralExpression.java:286) at org.apache.flink.table.expressions.ValueLiteralExpression.(ValueLiteralExpression.java:79) at org.apache.flink.table.expressions.ApiExpressionUtils.valueLiteral(ApiExpressionUtils.java:251) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitLiteral(RexNodeExtractor.scala:451) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitLiteral(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexLiteral.accept(RexLiteral.java:1173) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:458) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexCall.accept(RexCall.java:174) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:458) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexCall.accept(RexCall.java:174) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at
[jira] [Created] (FLINK-22020) Reflections ClassNotFound when trying to deserialize a json plan in deployment envrionment
Wenlong Lyu created FLINK-22020: --- Summary: Reflections ClassNotFound when trying to deserialize a json plan in deployment envrionment Key: FLINK-22020 URL: https://issues.apache.org/jira/browse/FLINK-22020 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Wenlong Lyu -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Releasing Stateful Functions 3.0.0
Wow, the feature list sounds really exciting! No concerns from my side! On Thu, Mar 25, 2021 at 1:57 PM Konstantin Knauf wrote: > Hi Gordon, > > Thank you for the update. +1 for a timely release. For existing Statefun > users, is there already something in the documentation that describes the > breaking changes/migration in more detail in order to prepare? > > Cheers, > > Konstantin > > On Thu, Mar 25, 2021 at 9:27 AM Tzu-Li (Gordon) Tai > wrote: > > > Hi everyone, > > > > We'd like to prepare to release StateFun 3.0.0 over the next few days, > > ideally starting the first release candidate early next week. > > > > This is a version bump from 2.x to 3.x, with some major new features and > > reworks: > > > >- New request-reply protocol: > >the protocol has been reworked as StateFun is moving forward towards a > >"remote functions first" design. The new protocol enhances StateFun > > apps to > >be hot upgraded without restarting the StateFun runtime, including > >registering new state for functions, and adding new functions to the > > app. > >- Cross-language type system: > >The new protocol also enables a much more ergonomic, cross-language > type > >system. This makes it much easier and natural for users to send > > messages of > >various types around between their functions (primitive types, or > custom > >types such as JSON messages). > >- Java SDK for remote functions: Going remote first, we've now also > >added a new Java SDK for remote functions. > > > > These are some major features that users would benefit from greatly. > > Since this release also contains breaking changes, it's nice if we can > get > > this out earlier so that new users would not be onboarded with APIs that > > are going to be immediately deprecated. > > > > We're in the final stages of preparing documentation and examples [1] to > go > > with the release, and would like to kick off the release candidates early > > next week. > > > > Please let us know if you have any concerns. > > > > Thanks, > > Gordon > > > > [1] https://github.com/apache/flink-statefun-playground > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk >
Re: Glob support on file access
Hi Etienne, In general, any small PR on this subject is very welcome. I don't think that the community as a whole will invest much into FileInputFormat as the whole DataSet API is phasing out. Afaik SQL and Table API are only using InputFormat for the legacy compatibility layer (e.g. when it comes to translating into DataSet). All the new batchy stuff is based on BulkFormat and unified source/sink interface. I'm CC'ing Timo who can correct me if I'm wrong. So if you just want to add glob support on FileInputFormat /only/ for SQL and Table API, I don't think it's worth the effort. It would be more interesting to see if the new FileSource does support it properly and rather add it there. On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot wrote: > But still this workaround would only work when you have access to the > underlying /FileInputFormat/. For//SQL and Table APIs, you don't so > you'll be unable to apply this workaround. So what we could do is make a > PR to support glob at the FileInputFormat level to profit for all APIs. > > I'm gonna do it if everyone agrees. > > Best > > Etienne Chauchot > > On 25/03/2021 13:12, Etienne Chauchot wrote: > > > > Hi all, > > > > In case it is useful to some of you: > > > > I have a big batch that needs to use globs (*.parquet for example) to > > read input files. It seems that globs do not work out of the box (see > > https://issues.apache.org/jira/browse/FLINK-6417) > > > > But there is a workaround: > > > > > > final FileInputFormat inputFormat =new FileInputFormat(new > Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/ > /*extact parent dir*/ > > inputFormat.setFilesFilter(new > GlobFilePathFilter(Collections.singletonList(filePath), > Collections.emptyList()));/*filePath contains glob, the whole path needs to > be provided to > > GlobFilePathFilter*/ > > inputFormat.setNestedFileEnumeration(true); > > > > Hope, it helps some people > > > > Etienne Chauchot > > > > >
Re: Re: Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished
Thanks a lot for all your input. To sum up the discussion so far: ## Final checkpoints We currently agree on favouring a single final checkpoint which can shut down the topology. In order to support this we need to be able to create a checkpoint after an operator has finished producing results. If we want to send checkpoint barriers through the topology this means that a task must not close the network connection when it sees "logical end of data". Instead, on the "logical end of data" the contained operator should flush all of its records. This means that we need to introduce a new event "logical end of data" and API calls to signal an operator that it should flush its data and that it should shut down. Given the available methods, `endInput` could be used for signalling the "logical end of data" and `dispose` for shutting the operator down. A task will only shut down and send an "EndOfPartitionEvent" which closes the TCP connection if all of its inputs have shut down and if it has completed a final checkpoint. ## Global commits Now a somewhat related but also orthogonal issue is how to support a global commit. A global commit is a commit where the external artefacts of a checkpoint become visible all at once. The global commit should be supported for streaming as well as batch executions (it is probably even more important for batch executions). In general, there could be different ways of implementing the global commit mechanism: 1. Collect global commit handles on the JM and run the global commit action on the JM 2. Collect global commit handles in a parallelism 1 operator which performs the global commit action Approach 2. would probably require to be able to send records from the snapshotState() method which would be the global commit handles. Both approaches would have to persist some kind of information in the checkpoint which allows redoing the global commit operation in case of a failover. Therefore, for approach 1. it would be required that we send the global commit handles to the JM from the snapshotState() method and not the notifyCheckpointComplete(). A related question is in which order to execute the local and global commit actions: 1. Unspecified order 2. First local commits and then global commits Option 1. would be easier to implement and might already be good enough for most sinks. I would suggest treating final checkpoints and global commits as two related but separate things. I think it is fine to first try to solve the final checkpoints problem and then to tackle the global commits. This will help to decrease the scope of each individual feature. Cheers, Till On Fri, Mar 5, 2021 at 5:12 AM Yun Gao wrote: > Hi Piotr, > > Very thanks for the suggestions and thoughts! > > > Is this a problem? Pipeline would be empty, so EndOfPartitionEvent would > be traveling very quickly. > > No, this is not a problem, sorry I have some wrong thoughts here, > initially in fact I'm thinking on this issue raised by > @kezhu: > > > Besides this, will FLIP-147 eventually need some ways to decide whether > an operator need final checkpoint > @Yun @Guowei ? @Arvid mentions this in earlier mail. > > For this issue itself, I'm still lean towards we might still need it, for > example, suppose we have a job that > do not need to commit anything on finished, then it do not need to wait > for checkpoint at all for normal > finish case. > > > Yes, but aren't we doing it right now anyway? > `StreamSource#advanceToEndOfEventTime`? > > Yes, we indeed have advancedEndOfEventTime for both legacy and new > sources, sorry for the overlook. > > > Is this the plan? That upon recovery we are restarting all operators, > even those that have already finished? > Certainly it's one of the possibilities. > > For the first version we would tend to use this way since it is easier to > implement, and we should always need > to consider the case that tasks are started but operators are finished > since there might be also tasks with part > of operators finished. For the long run I think we could continue to > optimize it via not restart the finished tasks > at all. > > > Keep in mind that those are two separate things, as I mentioned in a > previous e-mail: > > > II. When should the `GlobalCommitHandle` be created? Should it be > returned from `snapshotState()`, `notifyCheckpointComplete()` or somewhere > else? > > > III. What should be the ordering guarantee between global commit and > local commit, if any? Actually the easiest to implement would be undefined, > but de facto global commit happening before local commits (first invoke > > `notifyCheckpointComplete()` > on the `OperatorCoordinator` and either after or in parallel send > `notifyCheckpointComplete()` RPCs). As far as I can tell, undefined order > should work for the use cases that I'm aware of. > > > > We could create the `GlobalCommitHandle` in > `StreamOperator#snapshotState()`, while we could also ensure that > `notifyCheckpointComplete()` is
[jira] [Created] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure
Dawid Wysakowicz created FLINK-22019: Summary: UnalignedCheckpointRescaleITCase hangs on azure Key: FLINK-22019 URL: https://issues.apache.org/jira/browse/FLINK-22019 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.13.0 Reporter: Dawid Wysakowicz https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Glob support on file access
But still this workaround would only work when you have access to the underlying /FileInputFormat/. For//SQL and Table APIs, you don't so you'll be unable to apply this workaround. So what we could do is make a PR to support glob at the FileInputFormat level to profit for all APIs. I'm gonna do it if everyone agrees. Best Etienne Chauchot On 25/03/2021 13:12, Etienne Chauchot wrote: Hi all, In case it is useful to some of you: I have a big batch that needs to use globs (*.parquet for example) to read input files. It seems that globs do not work out of the box (see https://issues.apache.org/jira/browse/FLINK-6417) But there is a workaround: final FileInputFormat inputFormat =new FileInputFormat(new Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/ /*extact parent dir*/ inputFormat.setFilesFilter(new GlobFilePathFilter(Collections.singletonList(filePath), Collections.emptyList()));/*filePath contains glob, the whole path needs to be provided to GlobFilePathFilter*/ inputFormat.setNestedFileEnumeration(true); Hope, it helps some people Etienne Chauchot
[jira] [Created] (FLINK-22018) JdbcBatchingOutputFormat should not log SQLException on retry
Nicolas Deslandes created FLINK-22018: - Summary: JdbcBatchingOutputFormat should not log SQLException on retry Key: FLINK-22018 URL: https://issues.apache.org/jira/browse/FLINK-22018 Project: Flink Issue Type: Bug Components: Connectors/ RabbitMQ Affects Versions: 1.7.2, 1.8.0, 1.8.2 Reporter: Nicolas Deslandes Assignee: Nicolas Deslandes Fix For: 1.8.3, 1.9.2, 1.10.0 RabbitMQ connector do not close consumers and channel on closing This potentially leaves idle consumer on the queue that prevent any other consumer on the same queue to get message, this happens the most when a job is stop/cancel and redeploy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22017) Regions may never be scheduled when there are cross-region blocking edges
Zhilong Hong created FLINK-22017: Summary: Regions may never be scheduled when there are cross-region blocking edges Key: FLINK-22017 URL: https://issues.apache.org/jira/browse/FLINK-22017 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Zhilong Hong Attachments: Illustration.jpg For the topology with cross-region blocking edges, there are regions that may never be scheduled. The case is illustrated in the figure below. !Illustration.jpg! Let's denote the vertices with layer_number. It's clear that the edge connects v2_2 and v3_2 crosses region 1 and region 2. Since region 1 has no blocking edges connected to other regions, it will be scheduled first. When vertex2_2 is finished, PipelinedRegionSchedulingStrategy will trigger {{onExecutionStateChange}} for it. As expected, region 2 will be scheduled since all its consumer partitions are consumable. But in fact region 2 won't be scheduled, because the result partition of vertex2_2 is not tagged as consumable. Whether it is consumable or not is determined by its IntermediateDataSet. However, an IntermediateDataSet is consumable if and only if all the producers of its IntermediateResultPartitions are finished. This IntermediateDataSet will never be consumable since vertex2_3 is not scheduled. All in all, this forms a deadlock that a region will never be scheduled because it's not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22016) PushFilterIntoLegacyTableSourceScanRule fails to deal with NULLs
Caizhi Weng created FLINK-22016: --- Summary: PushFilterIntoLegacyTableSourceScanRule fails to deal with NULLs Key: FLINK-22016 URL: https://issues.apache.org/jira/browse/FLINK-22016 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Caizhi Weng Fix For: 1.13.0 Add the following test case to {{PushFilterIntoLegacyTableSourceScanRuleTest}} to reproduce this bug: {code:scala} @Test def myTest(): Unit = { val schema = TableSchema .builder() .field("a", DataTypes.STRING) .field("b", DataTypes.STRING) .build() val data = List(Row.of("foo", "bar")) TestLegacyFilterableTableSource.createTemporaryTable( util.tableEnv, schema, "MTable", isBounded = true, data, List("a", "b")) util.verifyRelPlan( """ |WITH MView AS (SELECT CASE | WHEN a = b THEN a | ELSE CAST(NULL AS STRING) | END AS a | FROM MTable) |SELECT a FROM MView WHERE a IS NOT NULL |""".stripMargin) } {code} The exception stack is {code} org.apache.flink.table.api.ValidationException: Data type 'STRING NOT NULL' does not support null values. at org.apache.flink.table.expressions.ValueLiteralExpression.validateValueDataType(ValueLiteralExpression.java:272) at org.apache.flink.table.expressions.ValueLiteralExpression.(ValueLiteralExpression.java:79) at org.apache.flink.table.expressions.ApiExpressionUtils.valueLiteral(ApiExpressionUtils.java:251) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitLiteral(RexNodeExtractor.scala:451) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitLiteral(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexLiteral.accept(RexLiteral.java:1173) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:458) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexCall.accept(RexCall.java:174) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter$$anonfun$8.apply(RexNodeExtractor.scala:459) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:458) at org.apache.flink.table.planner.plan.utils.RexNodeToExpressionConverter.visitCall(RexNodeExtractor.scala:359) at org.apache.calcite.rex.RexCall.accept(RexCall.java:174) at org.apache.flink.table.planner.plan.utils.RexNodeExtractor$$anonfun$extractConjunctiveConditions$1.apply(RexNodeExtractor.scala:136) at org.apache.flink.table.planner.plan.utils.RexNodeExtractor$$anonfun$extractConjunctiveConditions$1.apply(RexNodeExtractor.scala:135) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at
[jira] [Created] (FLINK-22015) SQL filter containing OR and IS NULL will produce an incorrect result.
Caizhi Weng created FLINK-22015: --- Summary: SQL filter containing OR and IS NULL will produce an incorrect result. Key: FLINK-22015 URL: https://issues.apache.org/jira/browse/FLINK-22015 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Caizhi Weng Fix For: 1.13.0 Add the following test case to {{CalcITCase}} to reproduce this bug. {code:scala} @Test def myTest(): Unit = { checkResult( """ |WITH myView AS (SELECT a, CASE | WHEN a = 1 THEN '1' | ELSE CAST(NULL AS STRING) | END AS s |FROM SmallTable3) |SELECT a FROM myView WHERE s = '2' OR s IS NULL |""".stripMargin, Seq(row(2), row(3))) } {code} However if we remove the {{s = '2'}} the result will be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup
Mikalai Lushchytski created FLINK-22014: --- Summary: Flink JobManager failed to restart after failure in kubernetes HA setup Key: FLINK-22014 URL: https://issues.apache.org/jira/browse/FLINK-22014 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.12.2 Reporter: Mikalai Lushchytski After the JobManager pod failed and the new one started, it was not able to recover jobs due to the absence of recovery data in storage - config map pointed at not existing file. Due to this the JobManager pod entered into the `CrashLoopBackOff`state and was not able to recover - each attempt failed with the same error so the whole cluster became unrecoverable and not operating. I had to manually delete the config map and start the jobs again without the save point. If I tried to emulate the failure further by deleting job manager pod manually, the new pod every time recovered well and issue was not reproducible anymore artificially. Below is the failure log: ``` 2021-03-26 08:22:57,925 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager. 2021-03-26 08:22:57,928 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='stellar-flink-cluster-dispatcher-leader'}. 2021-03-26 08:22:57,931 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'} 2021-03-26 08:22:57,933 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6. 2021-03-26 08:22:58,029 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess. 2021-03-26 08:28:22,677 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. java.util.concurrent.CompletionException: org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 198c46bac791e73ebcc565a550fa4ff6. at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) [?:?] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 198c46bac791e73ebcc565a550fa4ff6. at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198) ~[flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
[GitHub] [flink-ml] becketqin commented on pull request #1: [Flink-21976] Move Flink ML pipeline API and library code from apache/flink to apache/flink-ml
becketqin commented on pull request #1: URL: https://github.com/apache/flink-ml/pull/1#issuecomment-809247359 @lindong28 Thanks for the patch. Merged to master. 08d058046f34b711128e0646ffbdc7e384c22064 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [DISCUSS] Feature freeze date for 1.13
+1 for the 31st of March for the feature freeze. Cheers, Till On Mon, Mar 29, 2021 at 10:12 AM Robert Metzger wrote: > +1 for March 31st for the feature freeze. > > > > On Fri, Mar 26, 2021 at 3:39 PM Dawid Wysakowicz > wrote: > > > Thank you Thomas! I'll definitely check the issue you linked. > > > > Best, > > > > Dawid > > > > On 23/03/2021 20:35, Thomas Weise wrote: > > > Hi Dawid, > > > > > > Thanks for the heads up. > > > > > > Regarding the "Rebase and merge" button. I find that merge option > useful, > > > especially for small simple changes and for backports. The following > > should > > > help to safeguard from the issue encountered previously: > > > https://github.com/jazzband/pip-tools/issues/1085 > > > > > > Thanks, > > > Thomas > > > > > > > > > On Tue, Mar 23, 2021 at 4:58 AM Dawid Wysakowicz < > dwysakow...@apache.org > > > > > > wrote: > > > > > >> Hi devs, users! > > >> > > >> 1. *Feature freeze date* > > >> > > >> We are approaching the end of March which we agreed would be the time > > for > > >> a Feature Freeze. From the knowledge I've gather so far it still seems > > to > > >> be a viable plan. I think it is a good time to agree on a particular > > date, > > >> when it should happen. We suggest *(end of day CEST) March 31st* > > >> (Wednesday next week) as the feature freeze time. > > >> > > >> Similarly as last time, we want to create RC0 on the day after the > > feature > > >> freeze, to make sure the RC creation process is running smoothly, and > to > > >> have a common testing reference point. > > >> > > >> Having said that let us remind after Robert & Dian from the previous > > >> release what it a Feature Freeze means: > > >> > > >> *B) What does feature freeze mean?*After the feature freeze, no new > > >> features are allowed to be merged to master. Only bug fixes and > > >> documentation improvements. > > >> The release managers will revert new feature commits after the feature > > >> freeze. > > >> Rational: The goal of the feature freeze phase is to improve the > system > > >> stability by addressing known bugs. New features tend to introduce new > > >> instabilities, which would prolong the release process. > > >> If you need to merge a new feature after the freeze, please open a > > >> discussion on the dev@ list. If there are no objections by a PMC > member > > >> within 48 (workday)hours, the feature can be merged. > > >> > > >> 2. *Merge PRs from the command line* > > >> > > >> In the past releases it was quite frequent around the Feature Freeze > > date > > >> that we ended up with a broken main branch that either did not compile > > or > > >> there were failing tests. It was often due to concurrent merges to the > > main > > >> branch via the "Rebase and merge" button. To overcome the problem we > > would > > >> like to suggest only ever merging PRs from a command line. Thank you > > >> Stephan for the idea! The suggested workflow would look as follows: > > >> > > >>1. Pull the change and rebase on the current main branch > > >>2. Build the project (e.g. from IDE, which should be faster than > > >>building entire project from cmd) -> this should ensure the project > > compiles > > >>3. Run the tests in the module that the change affects -> this > should > > >>greatly minimize the chances of failling tests > > >>4. Push the change to the main branch > > >> > > >> Let us know what you think! > > >> > > >> Best, > > >> > > >> Guowei & Dawid > > >> > > >> > > >> > > > > >
[jira] [Created] (FLINK-22013) Add CI to run tests in Azure DevOps pipeline for every flink-ml pull request
Dong Lin created FLINK-22013: Summary: Add CI to run tests in Azure DevOps pipeline for every flink-ml pull request Key: FLINK-22013 URL: https://issues.apache.org/jira/browse/FLINK-22013 Project: Flink Issue Type: Improvement Reporter: Dong Lin We should add CI to run tests in Azure DevOps pipeline for every flink-ml pull requestl. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22012) Add nightly-build pipeline to test flink-ml repo with the latest flink repo snapshot
Dong Lin created FLINK-22012: Summary: Add nightly-build pipeline to test flink-ml repo with the latest flink repo snapshot Key: FLINK-22012 URL: https://issues.apache.org/jira/browse/FLINK-22012 Project: Flink Issue Type: Improvement Reporter: Dong Lin We should add nightly-build pipeline to test flink-ml repo with the latest Flink repo snapshot so that, if there is any issue in the Flink repo that breaks the flink-ml repo, we can catch the issue within 24 hours. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22011) Support local global optimization for window aggregation in runtime
Jark Wu created FLINK-22011: --- Summary: Support local global optimization for window aggregation in runtime Key: FLINK-22011 URL: https://issues.apache.org/jira/browse/FLINK-22011 Project: Flink Issue Type: Sub-task Components: Table SQL / Runtime Reporter: Jark Wu Assignee: Jark Wu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[RESULT] [VOTE] Apache Flink Jira Process (& Bot)
Hi everyone, I am happy to announce that the proposal has been accepted. We received five +1 votes, four of which are binding, and no vetoes: Till (binding) Roman (binding) Arvid (binding) Robert (binding) Matthias (non-binding) I will share an implementation proposal soon. Thanks, Konstantin -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk
Re: [DISCUSS] Feature freeze date for 1.13
+1 for March 31st for the feature freeze. On Fri, Mar 26, 2021 at 3:39 PM Dawid Wysakowicz wrote: > Thank you Thomas! I'll definitely check the issue you linked. > > Best, > > Dawid > > On 23/03/2021 20:35, Thomas Weise wrote: > > Hi Dawid, > > > > Thanks for the heads up. > > > > Regarding the "Rebase and merge" button. I find that merge option useful, > > especially for small simple changes and for backports. The following > should > > help to safeguard from the issue encountered previously: > > https://github.com/jazzband/pip-tools/issues/1085 > > > > Thanks, > > Thomas > > > > > > On Tue, Mar 23, 2021 at 4:58 AM Dawid Wysakowicz > > > wrote: > > > >> Hi devs, users! > >> > >> 1. *Feature freeze date* > >> > >> We are approaching the end of March which we agreed would be the time > for > >> a Feature Freeze. From the knowledge I've gather so far it still seems > to > >> be a viable plan. I think it is a good time to agree on a particular > date, > >> when it should happen. We suggest *(end of day CEST) March 31st* > >> (Wednesday next week) as the feature freeze time. > >> > >> Similarly as last time, we want to create RC0 on the day after the > feature > >> freeze, to make sure the RC creation process is running smoothly, and to > >> have a common testing reference point. > >> > >> Having said that let us remind after Robert & Dian from the previous > >> release what it a Feature Freeze means: > >> > >> *B) What does feature freeze mean?*After the feature freeze, no new > >> features are allowed to be merged to master. Only bug fixes and > >> documentation improvements. > >> The release managers will revert new feature commits after the feature > >> freeze. > >> Rational: The goal of the feature freeze phase is to improve the system > >> stability by addressing known bugs. New features tend to introduce new > >> instabilities, which would prolong the release process. > >> If you need to merge a new feature after the freeze, please open a > >> discussion on the dev@ list. If there are no objections by a PMC member > >> within 48 (workday)hours, the feature can be merged. > >> > >> 2. *Merge PRs from the command line* > >> > >> In the past releases it was quite frequent around the Feature Freeze > date > >> that we ended up with a broken main branch that either did not compile > or > >> there were failing tests. It was often due to concurrent merges to the > main > >> branch via the "Rebase and merge" button. To overcome the problem we > would > >> like to suggest only ever merging PRs from a command line. Thank you > >> Stephan for the idea! The suggested workflow would look as follows: > >> > >>1. Pull the change and rebase on the current main branch > >>2. Build the project (e.g. from IDE, which should be faster than > >>building entire project from cmd) -> this should ensure the project > compiles > >>3. Run the tests in the module that the change affects -> this should > >>greatly minimize the chances of failling tests > >>4. Push the change to the main branch > >> > >> Let us know what you think! > >> > >> Best, > >> > >> Guowei & Dawid > >> > >> > >> > >
[jira] [Created] (FLINK-22010) when flink executed union all opeators,exception occured
zhou created FLINK-22010: Summary: when flink executed union all opeators,exception occured Key: FLINK-22010 URL: https://issues.apache.org/jira/browse/FLINK-22010 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.12.1 Reporter: zhou *when I executed job on 1.11.2,the job no exception,when I executed job on 1.12.1 or 1.12.2 ,the job occured some exception.* *code as the following:* {quote}result = result1.union_all(result2) result = result.union_all(result3) # .union_all(result4).union_all(result5).union_all(result5).union_all(result6).union_all(result7) result.execute().print() {quote} above the code, when i comment the code as the following,the code also no exception on flink 1.12.1 : {quote}result = result1.union_all(result2) #result = result.union_all(result3) # .union_all(result4).union_all(result5).union_all(result5).union_all(result6).union_all(result7) result.execute().print() {quote} I dont know how to solve the problems, May be someone could help me? Excepion as the following: {quote}py4j.protocol.Py4JJavaError: An error occurred while calling o340.print. : java.lang.RuntimeException: Failed to fetch next result at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:109) at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80) at org.apache.flink.table.planner.sinks.SelectTableSinkBase$RowIteratorWrapper.hasNext(SelectTableSinkBase.java:117) at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:350) at org.apache.flink.table.utils.PrintUtils.printAsTableauForm(PrintUtils.java:149) at org.apache.flink.table.api.internal.TableResultImpl.print(TableResultImpl.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282) at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79) at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to fetch job execution result at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.getAccumulatorResults(CollectResultFetcher.java:169) at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:118) at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106) ... 16 more Caused by: java.util.concurrent.ExecutionException: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 9ba6325f27c97192e42e76bd52d05db8) at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.getAccumulatorResults(CollectResultFetcher.java:167) ... 18 more Caused by: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 9ba6325f27c97192e42e76bd52d05db8) at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:125) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) at org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$22(RestClusterClient.java:665) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:394) at
[jira] [Created] (FLINK-22009) when data type is map,we cannot union
ying zhang created FLINK-22009: -- Summary: when data type is map,we cannot union Key: FLINK-22009 URL: https://issues.apache.org/jira/browse/FLINK-22009 Project: Flink Issue Type: Bug Affects Versions: 1.12.0 Reporter: ying zhang Attachments: image-2021-03-29-15-45-48-627.png I create 2 tables which contains data type map,but i flind i cannot union two tables,is there any solutions to solve this problem? !image-2021-03-29-15-45-48-627.png! this is the create table statements: CREATE TABLE `map_string_string1`( `query` string, `wid` string, `index` int, `page` string, `hc_cid1` string, `hc_cid2` string, `hc_cid3` string, `cid1` string, `cid2` string, `cid3` string, `ts` bigint, `number_feature` map) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' CREATE TABLE `map_string_string2`( `query` string, `wid` string, `index` int, `page` string, `hc_cid1` string, `hc_cid2` string, `hc_cid3` string, `cid1` string, `cid2` string, `cid3` string, `ts` bigint, `number_feature` map) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22008) writing metadata is not an atomic operation, we should add a commit logic
xiaogang zhou created FLINK-22008: - Summary: writing metadata is not an atomic operation, we should add a commit logic Key: FLINK-22008 URL: https://issues.apache.org/jira/browse/FLINK-22008 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.12.2 Reporter: xiaogang zhou writing metadata is not an atomic operation, some logic can cause there is a metadata file in the checkpoint dir, but the data is corrupted if the jobmanager crash while writing the metadata. So we should consider to add commit operation in the checkpoint storage location -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink-ml] lindong28 commented on pull request #1: [Flink-21976] Move Flink ML pipeline API and library code from apache/flink to apache/flink-ml
lindong28 commented on pull request #1: URL: https://github.com/apache/flink-ml/pull/1#issuecomment-809144123 @becketqin Could you review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-ml] lindong28 opened a new pull request #1: [Flink-21976] Move Flink ML pipeline API and library code from apache/flink to apache/flink-ml
lindong28 opened a new pull request #1: URL: https://github.com/apache/flink-ml/pull/1 ## What is the purpose of the change Move Flink ML pipeline API and library code from apache/flink to apache/flink-ml ## Brief change log - Move files under flink/flink-ml-parent to flink-ml repo - Add CODE_OF_CONDUCT.md, LICENSE and .gitignore - Add files needed for checkstyle under tools/maven - Update pom.xml to include plugins from apache/flink/pom.xml that are needed to build and release this flink-ml repo. ## Verifying this change 1) This PR could run and pass all unit tests. 2) Run `mvn install` in both `apache/flink-ml` and `apache/flink/flink-ml-parent` and verify that they generate the same set of files (e.g. *.pom files and *.jar files) at the same path under `~/.m2/repository/org/apache/flink/` 3) `mvn install` generates `flink-ml-api-1.13-SNAPSHOT.jar`, `flink-ml-lib_2.11-1.13-SNAPSHOT.jar` and `flink-ml-uber_2.11-1.13-SNAPSHOT.jar`. I used IntellIj to compare the jar files with those jar files generated by `apache/flink/flink-ml-parent` and verified that they contains the same class files. The only difference in the jar files is that the jar file from this repo has NOTICE that says "Copyright 2019-2020 The Apache Software Foundation", whereas the jar file from apache/flink has NOTICE that says "Copyright 2014-2021 The Apache Software Foundation". That is not clear to me where that difference comes from. I believe this is not a blocking issue. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [BULK]Re: [SURVEY] Remove Mesos support
+1 On Mon, Mar 29, 2021 at 5:44 AM Yangze Guo wrote: > +1 > > Best, > Yangze Guo > > On Mon, Mar 29, 2021 at 11:31 AM Xintong Song > wrote: > > > > +1 > > It's already a matter of fact for a while that we no longer port new > features to the Mesos deployment. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Fri, Mar 26, 2021 at 10:37 PM Till Rohrmann > wrote: > >> > >> +1 for officially deprecating this component for the 1.13 release. > >> > >> Cheers, > >> Till > >> > >> On Thu, Mar 25, 2021 at 1:49 PM Konstantin Knauf > wrote: > >>> > >>> Hi Matthias, > >>> > >>> Thank you for following up on this. +1 to officially deprecate Mesos > in the code and documentation, too. It will be confusing for users if this > diverges from the roadmap. > >>> > >>> Cheers, > >>> > >>> Konstantin > >>> > >>> On Thu, Mar 25, 2021 at 12:23 PM Matthias Pohl > wrote: > > Hi everyone, > considering the upcoming release of Flink 1.13, I wanted to revive the > discussion about the Mesos support ones more. Mesos is also already > listed > as deprecated in Flink's overall roadmap [1]. Maybe, it's time to > align the > documentation accordingly to make it more explicit? > > What do you think? > > Best, > Matthias > > [1] https://flink.apache.org/roadmap.html#feature-radar > > On Wed, Oct 28, 2020 at 9:40 AM Till Rohrmann > wrote: > > > Hi Oleksandr, > > > > yes you are right. The biggest problem is at the moment the lack of > test > > coverage and thereby confidence to make changes. We have some e2e > tests > > which you can find here [1]. These tests are, however, quite coarse > grained > > and are missing a lot of cases. One idea would be to add a Mesos > e2e test > > based on Flink's end-to-end test framework [2]. I think what needs > to be > > done there is to add a Mesos resource and a way to submit jobs to a > Mesos > > cluster to write e2e tests. > > > > [1] https://github.com/apache/flink/tree/master/flink-jepsen > > [2] > > > https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common > > > > Cheers, > > Till > > > > On Tue, Oct 27, 2020 at 12:29 PM Oleksandr Nitavskyi < > > o.nitavs...@criteo.com> wrote: > > > >> Hello Xintong, > >> > >> Thanks for the insights and support. > >> > >> Browsing the Mesos backlog and didn't identify anything critical, > which > >> is left there. > >> > >> I see that there are were quite a lot of contributions to the > Flink Mesos > >> in the recent version: > >> https://github.com/apache/flink/commits/master/flink-mesos. > >> We plan to validate the current Flink master (or release 1.12 > branch) our > >> Mesos setup. In case of any issues, we will try to propose changes. > >> My feeling is that our test results shouldn't affect the Flink 1.12 > >> release cycle. And if any potential commits will land into the > 1.12.1 it > >> should be totally fine. > >> > >> In the future, we would be glad to help you guys with any > >> maintenance-related questions. One of the highest priorities > around this > >> component seems to be the development of the full e2e test. > >> > >> Kind Regards > >> Oleksandr Nitavskyi > >> > >> From: Xintong Song > >> Sent: Tuesday, October 27, 2020 7:14 AM > >> To: dev ; user > >> Cc: Piyush Narang > >> Subject: [BULK]Re: [SURVEY] Remove Mesos support > >> > >> Hi Piyush, > >> > >> Thanks a lot for sharing the information. It would be a great > relief that > >> you are good with Flink on Mesos as is. > >> > >> As for the jira issues, I believe the most essential ones should > have > >> already been resolved. You may find some remaining open issues > here [1], > >> but not all of them are necessary if we decide to keep Flink on > Mesos as is. > >> > >> At the moment and in the short future, I think helps are mostly > needed on > >> testing the upcoming release 1.12 with Mesos use cases. The > community is > >> currently actively preparing the new release, and hopefully we > could come > >> up with a release candidate early next month. It would be greatly > >> appreciated if you fork as experienced Flink on Mesos users can > help with > >> verifying the release candidates. > >> > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> [1] > >> > https://issues.apache.org/jira/browse/FLINK-17402?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20%22Deployment%20%2F%20Mesos%22%20AND%20status%20%3D%20Open > >> < > >> >
[jira] [Created] (FLINK-22007) PartitionReleaseInBatchJobBenchmarkExecutor seems to be failing
Piotr Nowojski created FLINK-22007: -- Summary: PartitionReleaseInBatchJobBenchmarkExecutor seems to be failing Key: FLINK-22007 URL: https://issues.apache.org/jira/browse/FLINK-22007 Project: Flink Issue Type: Bug Components: Benchmarks, Runtime / Coordination Affects Versions: 1.13.0 Reporter: Piotr Nowojski Fix For: 1.13.0 Travis CI is failing: https://travis-ci.com/github/apache/flink-benchmarks/builds/221290042 While there is also some problem with the Jenkins builds for the same benchmark. http://codespeed.dak8s.net:8080/job/flink-scheduler-benchmarks/232 It would be also interesting for the future to understand why the Jenkins build is green and try to fix it (ideally, if some benchmarks fail, partial results should be still uploaded but the Jenkins build should be marked as failed). Otherwise issues like that can remain unnoticed for quite a bit of time. CC [~Thesharing] [~zhuzh] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22006) Could not run more than 20 jobs in a native K8s session with K8s HA enabled
Yang Wang created FLINK-22006: - Summary: Could not run more than 20 jobs in a native K8s session with K8s HA enabled Key: FLINK-22006 URL: https://issues.apache.org/jira/browse/FLINK-22006 Project: Flink Issue Type: Bug Affects Versions: 1.12.2, 1.13.0 Reporter: Yang Wang Attachments: image-2021-03-24-18-08-42-116.png Currently, if we start a native K8s session cluster with K8s HA enabled, we could not run more than 20 streaming jobs. The latest job is always initializing, and the previous one is created and waiting to be assigned. It seems that some internal resources have been exhausted, e.g. okhttp thread pool , tcp connections or something else. !image-2021-03-24-18-08-42-116.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)